TW202309771A - Apparatus and method for generating an entity-relation extraction model - Google Patents
Apparatus and method for generating an entity-relation extraction model Download PDFInfo
- Publication number
- TW202309771A TW202309771A TW110131864A TW110131864A TW202309771A TW 202309771 A TW202309771 A TW 202309771A TW 110131864 A TW110131864 A TW 110131864A TW 110131864 A TW110131864 A TW 110131864A TW 202309771 A TW202309771 A TW 202309771A
- Authority
- TW
- Taiwan
- Prior art keywords
- entity
- relationship
- information
- marked
- generating
- Prior art date
Links
Images
Landscapes
- Machine Translation (AREA)
- Paper (AREA)
Abstract
Description
本發明係關於一種產生實體關係抽取模型的裝置及方法。具體而言,本發明係關於一種執行預標註程序及訓練模型程序以產生實體關係抽取模型的裝置及方法。The invention relates to a device and method for generating an entity relationship extraction model. Specifically, the present invention relates to a device and method for executing a pre-labeling program and a training model program to generate an entity relationship extraction model.
知識抽取是知識管理中最重要的第一步,知識抽取是在大量文件中萃取有用的知識,其中包含實體與關係。透過這些知識,能使得各種應用服務在遇到需要作出判斷的場景時,能快速並且精準的做出決定,完成該場景的任務。許多應用及解決方案都仰賴結構化文本資訊的知識來完成特定的功能,例如:搜尋引擎、自動導航、智慧問答、推薦系統以及對話機器人等等,若欲進一步提升其智慧程度則需要利用知識圖譜及語義知識庫,因此實體關係抽取是建構知識庫的關鍵技術之一。Knowledge extraction is the most important first step in knowledge management. Knowledge extraction is to extract useful knowledge from a large number of documents, including entities and relationships. Through this knowledge, various application services can quickly and accurately make decisions when encountering a scene that needs to be judged, and complete the task of the scene. Many applications and solutions rely on the knowledge of structured text information to complete specific functions, such as: search engines, automatic navigation, intelligent question answering, recommendation systems, and dialogue robots, etc. If you want to further improve its intelligence, you need to use knowledge graphs And semantic knowledge base, so entity relationship extraction is one of the key technologies for constructing knowledge base.
現有的實體關係抽取方法,主要以人工規則模板及句法結構分析為主。具體而言,人工規則模板是利用領域專家設計的模板規則進行匹配,在面對新的領域或資料時需要重新設計新的模板,除了設計耗時之外,亦僅適用於小的領域。句法結構則是由語言分析學家剖析單一語言的句法規則及結構來構建句法,針對輸入文本句子進行結構拆分並辨別實體名詞與動詞關係,然而存在全句標註的成本極高且無法快速轉換領域或語言等缺點。因此,不論採用前述的哪種實體關係抽取方法,均需要專家或學者的介入,耗費大量的人工標註成本及時間,亦無法快速且彈性的針對不同領域轉換。Existing entity relationship extraction methods mainly rely on artificial rule templates and syntactic structure analysis. Specifically, artificial rule templates are matched using template rules designed by domain experts. When facing new fields or data, new templates need to be redesigned. In addition to the time-consuming design, it is only applicable to small fields. The syntactic structure is constructed by language analysts who analyze the syntactic rules and structure of a single language, split the structure of the input text sentence and identify the relationship between entity nouns and verbs. However, the cost of full sentence annotation is extremely high and cannot be quickly converted Shortcomings such as domain or language. Therefore, no matter which of the aforementioned entity relationship extraction methods is used, the intervention of experts or scholars is required, which consumes a lot of manual labeling costs and time, and cannot quickly and flexibly switch between different fields.
有鑑於此,如何有效率且自動化的產生實體關係抽取模型,乃業界亟需努力之目標。In view of this, how to efficiently and automatically generate an entity relationship extraction model is an urgent goal in the industry.
本發明之一目的在於提供一種產生實體關係抽取模型的裝置。該裝置包含一儲存器及一處理器,該處理器電性連接至該儲存器。該儲存器用以儲存一實體關係資料庫,其中該實體關係資料庫至少包含複數個實體資訊及複數個關係資訊。該處理器用以執行一預標註程序及一訓練模型程序,其中該預標註程序包含下列步驟:該處理器接收一待標註文本。該處理器基於該待標註文本中的複數個字段以及該實體關係資料庫中的該等實體資訊與該等關係資訊,產生對應各該字段之至少一待標註實體資訊以及對應各該字段之至少一待標註關係資訊。該處理器根據一改良式標註格式對各該字段之該至少一待標註實體資訊及該至少一待標註關係資訊進行標註,以產生至少一標註後實體資訊及至少一標註後關係資訊。該處理器由該至少一標註後實體資訊與該至少一標註後關係資訊產生複數個組合且儲存至該實體關係資料庫。該訓練模型程序包含下列步驟:該處理器以一預訓練語言模型為基礎,將該等組合輸入至該預訓練語言模型,以產生一實體關係抽取模型。An object of the present invention is to provide a device for generating an entity relationship extraction model. The device includes a storage and a processor, and the processor is electrically connected to the storage. The storage is used to store an entity relationship database, wherein the entity relationship database at least includes a plurality of entity information and a plurality of relationship information. The processor is used for executing a pre-marking program and a training model program, wherein the pre-marking program includes the following steps: the processor receives a text to be marked. The processor generates at least one entity information corresponding to each field and at least one entity information corresponding to each field based on the plurality of fields in the text to be labeled and the entity information and the relationship information in the entity relationship database. The relationship information is to be marked. The processor marks the at least one entity information to be marked and the at least one relational information to be marked in each field according to an improved marking format, so as to generate at least one marked entity information and at least one marked relational information. The processor generates a plurality of combinations from the at least one labeled entity information and the at least one labeled relationship information and stores them in the entity relationship database. The training model program includes the following steps: the processor is based on a pre-trained language model, and inputs the combinations into the pre-trained language model to generate an entity relationship extraction model.
本發明之另一目的在於提供一種產生實體關係抽取模型的方法。該方法用於一產生實體關係抽取模型的裝置,該產生實體關係抽取模型的裝置包含一儲存器及一處理器,該儲存器儲存一實體關係資料庫,其中該實體關係資料庫至少包含複數個實體資訊及複數個關係資訊,該產生實體關係抽取模型的方法由該處理器所執行且包含下列步驟: 執行一預標註程序及一訓練模型程序,其中該預標註程序包含下列步驟:接收一待標註文本; 基於該待標註文本中的複數個字段以及該實體關係資料庫中的該等實體資訊與該等關係資訊,產生對應各該字段之至少一待標註實體資訊以及對應各該字段之至少一待標註關係資訊; 根據一改良式標註格式對各該字段之該至少一待標註實體資訊及該至少一待標註關係資訊進行標註,以產生至少一標註後實體資訊及至少一標註後關係資訊; 由該至少一標註後實體資訊與該至少一標註後關係資訊的複數個組合且儲存至該實體關係資料庫; 其中,該訓練模型程序包含下列步驟: 以一預訓練語言模型為基礎,將該等組合輸入至該預訓練語言模型,以產生一實體關係抽取模型。Another object of the present invention is to provide a method for generating an entity-relationship extraction model. The method is used in a device for generating an entity-relationship extraction model. The device for generating an entity-relationship extraction model includes a storage and a processor. The storage stores an entity-relationship database, wherein the entity-relationship database includes at least a plurality of Entity information and a plurality of relationship information, the method of generating entity relationship extraction model is executed by the processor and includes the following steps: Execute a pre-labeling program and a training model program, wherein the pre-labeling program includes the following steps: receiving a waiting Annotate text; Based on the plurality of fields in the text to be annotated and the entity information and the relationship information in the entity relationship database, generate at least one entity information to be annotated corresponding to each field and at least one entity information corresponding to each field A relational information to be marked; according to an improved marking format, mark the at least one entity information to be marked and the at least one relational information to be marked in each field, so as to generate at least one marked entity information and at least one marked relational information ; store multiple combinations of the at least one tagged entity information and the at least one tagged relationship information in the entity-relationship database; wherein, the training model program includes the following steps: Based on a pre-trained language model, the These combinations are input to the pre-trained language model to generate an entity relationship extraction model.
由上述說明可知,傳統的實體關係抽取模型訓練通常需要重頭開始訓練,且需要經過大量的人工標註/介入所產生的輸入資料才能達成效果。有別於傳統的模型產生方式,本發明所提供之產生實體關係抽取模型技術(至少包含裝置及方法)架構在預訓練模型上,並透過預標註程序的機制,快速標註輸入資料並擴增實體關係資料庫,自動化的產生大量的資料,而不需要人力的介入,因而使得實體關係抽取模型能夠快速地被訓練。此外,本發明更透過改良式標註格式的資訊,加速實體關係抽取模型的訓練速度。因而解決了習知技術中,實體關係抽取模型均需要專家或學者的介入,耗費大量的人工標註成本及時間,亦無法快速且彈性的針對不同領域轉換的缺點。It can be seen from the above description that the traditional entity relationship extraction model training usually needs to start training from scratch, and requires a lot of input data generated by manual labeling/intervention to achieve results. Different from the traditional model generation method, the entity relationship extraction model technology (including at least the device and method) provided by the present invention is based on the pre-training model, and through the mechanism of the pre-labeling program, the input data is quickly marked and the entity is expanded The relational database automatically generates a large amount of data without human intervention, thus enabling the entity-relationship extraction model to be trained quickly. In addition, the present invention further accelerates the training speed of the entity relationship extraction model through the information in the improved label format. Therefore, in the conventional technology, the entity relationship extraction model requires the intervention of experts or scholars, consumes a lot of manual labeling costs and time, and cannot quickly and flexibly switch between different fields.
以下將結合圖式闡述本發明之詳細技術及實施方式,俾使本發明所屬技術領域中具有通常知識者能理解所請求保護之發明之技術特徵。The detailed technology and implementation of the present invention will be described below in conjunction with the drawings, so that those with ordinary knowledge in the technical field of the present invention can understand the technical characteristics of the claimed invention.
以下將透過實施方式來解釋本發明所提供之產生實體關係抽取模型的裝置及方法。然而,該等實施方式並非用以限制本發明需在如該等實施方式所述之任何環境、應用或方式方能實施。因此,關於實施方式之說明僅為闡釋本發明之目的,而非用以限制本發明之範圍。應理解,在以下實施方式及圖式中,與本發明非直接相關之元件已省略而未繪示,且各元件之尺寸以及元件間之尺寸比例僅為例示而已,而非用以限制本發明之範圍。The device and method for generating an entity relationship extraction model provided by the present invention will be explained below through implementation. However, these embodiments are not intended to limit the present invention to be implemented in any environment, application or manner as described in these embodiments. Therefore, the description of the embodiments is only for the purpose of explaining the present invention, rather than limiting the scope of the present invention. It should be understood that in the following embodiments and drawings, elements that are not directly related to the present invention have been omitted and not shown, and the dimensions of each element and the size ratio between elements are for illustration only, and are not intended to limit the present invention range.
本發明之第一實施方式為一產生實體關係抽取模型的裝置1,其架構示意圖係描繪於第1圖。於本實施方式中,產生實體關係抽取模型的裝置1包含一儲存器11、一收發介面13及一處理器15,處理器15電性連接至儲存器11及收發介面13。儲存器11可為一記憶體、一通用串列匯流排(Universal Serial Bus;USB)碟、一硬碟、一光碟、一隨身碟或本發明所屬技術領域中具有通常知識者所知且具有相同功能之任何其他儲存媒體或電路。收發介面13為一可接收及傳輸資料之介面或本發明所屬技術領域中具有通常知識者所知悉之其他可接收及傳輸資料之介面,收發介面13可透過例如:外部裝置、外部網頁、外部應用程式等等來源接收資料。處理器15可為各種處理單元、中央處理單元(Central Processing Unit;CPU)、微處理器或本發明所屬技術領域中具有通常知識者所知悉之其他計算裝置。於某些實施方式中,產生實體關係抽取模型的裝置1可為但不限於是移動式電子裝置、桌上型電腦、可攜式電腦等等的電子裝置。The first embodiment of the present invention is a
於本實施方式中,儲存器11儲存實體關係資料庫400,實體關係資料庫400至少包含複數個實體資訊及複數個關係資訊。為便於理解,第2圖例示了實體關係資料庫400的一種態樣。如第2圖所示,實體關係資料庫400紀錄了輸入資料、實體1、關係、實體2及信心分數等欄位。以第2圖中實體關係資料庫400的第1筆資料為例,實體關係資料庫400記錄了輸入資料為「Tom was born in Honolulu, Hawaii」,對應該輸入資料的實體1為「Tom」、關係為「was born in」、實體2為「 Honolulu」及信心分數為「1.0」。In this embodiment, the storage 11 stores the
於某些實施方式中,實體關係資料庫400是由處理器15執行一爬蟲程序及一實體關係資料庫建構程序產生。該爬蟲程序包含下列步驟:處理器15收集複數個知識庫資料內容,各該知識庫資料內容包含複數個條目名稱及對應各該條目名稱之一條目內文。處理器15對該各該條目內文進行一斷句處理,以產生一輸入資料。該實體關係資料庫建構程序包含下列步驟:處理器15將該輸入資料輸入至一實體關係抽取系統,以產生一輸出資料,其中該輸出資料包含複數筆三元組資料,各該三元組資料包含複數個實體資訊、至少一關係資訊及一信心分數。處理器15基於該信心分數,將輸出資料中該信心分數超越一預設值的該等三元組資料儲存至該實體關係資料庫。In some implementations, the entity-
舉例而言,在爬蟲程序中可透過處理器15執行爬蟲程式,從通用知識庫(例如:dbpedia、YAGO、freebase、Wikipedia等)、領域知識庫(例如:專利知識庫、製造業用語知識庫等)、標準實體關係資料集知識庫(例如:OPIEC、OIE2016)等資料來源,抓取各該知識庫的條目名稱(例如:與某類別相關的資料庫)及條目內文(例如:與某類別相關的文章)。接著,處理器15以句號為斷句的規則,對各該條目內文執行斷句處理,產生以單句為單位之複數個輸入資料。須說明者,在爬蟲程式抓取各該知識庫的條目名稱及條目內文後,處理器15可進一步對條目內文執一前處理運作,例如:提取文本段落、去除Html標籤、去除重複句子、去除不正常亂碼訊息等資料清理運作。For example, in the crawler program, the processor 15 can be used to execute the crawler program, from general knowledge bases (such as: dbpedia, YAGO, freebase, Wikipedia, etc.), domain knowledge bases (such as: patent knowledge base, manufacturing language knowledge base, etc.) ), standard entity-relationship database knowledge bases (for example: OIEEC, OIE2016) and other data sources, grab the entry names (for example: databases related to a certain category) and entry texts (for example: related to a certain category related articles). Next, the processor 15 performs sentence segmentation processing on the content of each entry using a period as a sentence segmentation rule to generate a plurality of input data with a single sentence as a unit. It should be noted that after the crawler program grabs the entry names and entry texts of each knowledge base, the processor 15 can further perform a pre-processing operation on the entry text, for example: extract text paragraphs, remove Html tags, and remove repeated sentences , Data cleaning operations such as removing abnormal garbled messages.
又舉例而言,處理器15在實體關係資料庫400的建構程序中,處理器15將輸入資料輸入至一實體關係抽取系統,該實體關係抽取系統可為已基於機器學習訓練而成的開源實體關係抽取工具,例如:OpenIE5、RnnOIE等。接著,處理器15透過該實體關係抽取系統將斷句處理後的條目內文(即,輸入資料),抽取出包含複數個實體資訊、至少一關係資訊及一信心分數的複數個三元組資料。如第2圖所示,各該三元組資料包含實體1、關係及實體2及一信心分數,其中信心分數代表對於該筆三元組資料抽取結果的信心程度,信心分數可由實體關係抽取系統自動產生。最後,處理器15可透過將信心分數的預設值設為0.85,將信心分數大於0.85的該等三元組資料儲存至該實體關係資料庫400。For another example, during the construction procedure of the
於某些實施方式中,實體關係資料庫400亦可由外部裝置產生,由處理器15透過收發介面13接收實體關係資料庫400儲存至儲存器11。須說明者,第2圖僅方便作為例示,但其非用以限制本發明之範圍,實際運作中實體關係資料庫400亦可包含其他欄位(例如:資料來源)。In some implementations, the
接著繼續說明,產生實體關係抽取模型的裝置1的具體運作,請參考第1圖。於本實施方式中,處理器15將執行預標註程序及訓練模型程序。首先,於該預標註程序中,處理器15先透過該收發介面13接收一待標註文本133。須說明者,待標註文本133是尚未進行實體及關係標註的文章,可例如是某類別的文章、或是與本次訓練模型領域相關的文章,而待標註文本133將用於後續擴增實體關係資料庫400的資料。Next, continue to describe the specific operation of the
於某些實施方式中,處理器15會對於待標註文本133執行斷句處理,產生以單句為單位之複數個字段。於某些實施方式中,處理器15會對於待標註文本133進行一文本前處理運作,例如:提取文本段落、去除Html標籤、去除重複句子、去除不正常亂碼訊息等資料清理運作。In some implementations, the processor 15 performs sentence segmentation processing on the text to be marked 133 to generate a plurality of fields in units of a single sentence. In some implementations, the processor 15 will perform a pre-text processing operation on the text to be marked 133 , such as extracting text paragraphs, removing Html tags, removing repeated sentences, removing abnormal garbled messages, and other data cleaning operations.
接著,處理器15基於待標註文本133中的複數個字段以及實體關係資料庫400中的該等實體資訊與該等關係資訊,產生對應各該字段之至少一待標註實體資訊以及對應各該字段之至少一待標註關係資訊。具體而言,產生對應各該字段之該至少一待標註實體資訊以及對應各該字段之該至少一待標註關係資訊可包含下列步驟。首先,由處理器15比對待標註文本133中的該等字段以及實體關係資料庫400中的該等實體資訊,以產生對應各該字段之該至少一待標註實體資訊。接著,由處理器15比對包含至少二個待標註實體資訊的各該字段以及實體關係資料庫400中的該等關係資訊,以產生對應各該字段之該至少一待標註關係資訊。Next, the processor 15 generates at least one entity information to be marked corresponding to each field and corresponding to each field based on the plurality of fields in the text to be marked 133 and the entity information and the relationship information in the
隨後,處理器15根據一改良式標註格式對各該字段之該至少一待標註實體資訊及該至少一待標註關係資訊進行標註,以產生至少一標註後實體資訊及至少一標註後關係資訊。處理器15由該至少一標註後實體資訊與該至少一標註後關係資訊產生複數個組合且儲存至實體關係資料庫400。於某些實施方式中,由該至少一標註後實體資訊與該至少一標註後關係資訊產生該等組合係由處理器15根據各該字段之該至少一標註後實體資訊及該至少一標註後關係資訊於該字段的一先後順序,產生各該字段中的該至少一標註後實體資訊與該至少一標註後關係資訊的該等個組合。於某些實施方式中,該改良式標註格式是由一傳統序列標註格式(例如:BMES、BIO、BIOES等等)及對應該傳統序列標註格式之一實體標籤及關係標籤所組成。Subsequently, the processor 15 marks the at least one entity information to be marked and the at least one relational information to be marked in each field according to an improved marking format, so as to generate at least one marked entity information and at least one marked relational information. The processor 15 generates a plurality of combinations from the at least one labeled entity information and the at least one labeled relationship information and stores them in the
為便於理解,以一具體範例舉例說明執行預標註程序的流程,請參考第1圖、第2圖及第3圖,但其非用以限制本發明之範圍。於本範例中,待標註文本133包含其中一個句子為「Wang was born in Taiwan, Tainan, Zhongshan street」的字段A。首先,處理器15將字段A與第2圖中實體關係資料庫400中的各個實體資訊比對(即,實體1及實體2欄位),以判斷字段A中哪些單詞/短語是屬於實體。於本範例中,由於字段A中的「Wang」、「Taiwan」、「Tainan」及「Zhongshan street」分別已在實體關係資料庫400中的第4、5、6筆中被標註為實體,因此經由比對後,處理器15產生對應字段A的待標註實體資訊為「Wang」、「Taiwan」、「Tainan」及「Zhongshan street」(依於字段A中出現的順序)。For ease of understanding, a specific example is used to illustrate the process of executing the pre-marking program, please refer to FIG. 1 , FIG. 2 and FIG. 3 , but it is not intended to limit the scope of the present invention. In this example, the text to be marked 133 includes a field A in which a sentence is "Wang was born in Taiwan, Tainan, Zhongshan street". First, the processor 15 compares field A with the entity information in the
接著,處理器15判斷哪些字段包含至少二個待標註實體資訊(即,有機會能透過二個實體及一關係形成一組合;因為若沒有二個實體,即使有關係,仍無法形成組合)。於本範例中,由於字段A具有超過二個待標註實體資訊,處理器15將字段A與第2圖中實體關係資料庫400中的各個關係資訊比對(即,關係欄位),以判斷字段A中哪些單詞/短語是屬於關係。於本範例中,由於「was born in」已在實體關係資料庫400中的第1、2筆中被標註為關係,因此經由比對後,處理器15產生對應字段A的待標註關係資訊為「was born in」。Next, the processor 15 determines which fields contain information about at least two entities to be labeled (that is, there is a chance to form a combination through two entities and a relationship; because without two entities, even if there is a relationship, the combination cannot be formed). In this example, since field A has more than two entity information to be marked, the processor 15 compares field A with each relation information in the
隨後,處理器15根據改良式標註格式對各該字段之該至少一標註後實體資訊及該至少一標註後關係資訊進行標註。於本範例中,採用BMES標註(即,B為一個詞的詞首位置、M為一個詞的中間位置、E為一個詞的末尾位置、S為一個單獨的字詞)。以前述字段A的待標註實體資訊及待標註關係資訊舉例而言,處理器15對待標註實體資訊「Wang」標註,並在傳統序列標註格式S之前,加上前綴 Entity 成為「Wang [Entity-S] 」、對「Taiwan」標註「Taiwan [Entity-S] 」、對「Tainan」標註「Tainan [Entity-S] 」、對「Zhongshan street」標註「Zhongshan [Entity-B] street [Entity-E] 」,標註後的「Wang [Entity-S]、「Taiwan [Entity-S] 」、「Tainan [Entity-S]、「Zhongshan [Entity-B] street [Entity-E]即為標註後實體資訊。處理器15對待標註關係資訊「was born in」標註,並在傳統序列標註格式B、M或E之前,加上前綴 Relation 成為「was [Relation-B] born [Relation-M] in [Relation-E]」,標註後的「was [Relation-B] born [Relation-M] in [Relation-E]即為標註後實體資訊。Subsequently, the processor 15 marks the at least one tagged entity information and the at least one tagged relationship information of each field according to the improved tagging format. In this example, BMES tagging is used (ie, B is the initial position of a word, M is the middle position of a word, E is the end position of a word, and S is a single word). Taking the entity information to be labeled and the relationship information to be labeled in the aforementioned field A as an example, the processor 15 labels the entity information to be labeled "Wang", and adds the prefix Entity before the traditional sequence label format S to become "Wang [Entity-S ] ", mark "Taiwan [Entity-S] " for "Taiwan", mark "Tainan [Entity-S] " for "Tainan", mark "Zhongshan [Entity-B] street [Entity-E] for "Zhongshan street" ", the marked "Wang [Entity-S], "Taiwan [Entity-S] ", "Tainan [Entity-S], "Zhongshan [Entity-B] street [Entity-E] are the marked entity information. Processor 15 labels the relational information "was born in", and adds the prefix Relation before the traditional sequence label format B, M or E to become "was [Relation-B] born [Relation-M] in [Relation-E ]", the marked "was [Relation-B] born [Relation-M] in [Relation-E] is the marked entity information.
接著,處理器15依於字段A中出現的先後順序,由各該字段中的該標註後實體資訊與該標註後關係資訊產生複數個組合且儲存至實體關係資料庫400。依前述字段A的標註後實體資訊及標註後關係資訊舉例而言,請參考第3圖,處理器15基於字段A的標註後實體資訊「Wang [Entity-S] 」、「Taiwan [Entity-S] 」、「Tainan [Entity-S]及「Zhongshan [Entity-B] street [Entity-E] 」以及標註後關係資訊「was [Relation-B] born [Relation-M] in [Relation-E]」,處理器15產生分別符合實體1、關係、實體2的排列的「Wang was born in Taiwan」、「Wang was born in Tainan」、「Wang was born in Zhongshan street」組合,並儲存至第3圖的實體關係資料庫400的第7、8、9筆(於本範例中,產生的信心分數預設為1)。Next, the processor 15 generates a plurality of combinations from the tagged entity information and the tagged relationship information in each field according to the order in which they appear in the field A and stores them in the entity-
據此,處理器15對標註文本133中包含的複數個字段都可進行同樣的運作,處理器15藉由實體關係資料庫400的字串比對來進行自動標註,並可產生倍數於原字段的多種組合,進以擴充實體關係資料庫400中的資料內容。此外,由於處理器15透過改良式標註格式對於各該字段中的實體及關係進行標註,使得實體關係資料庫400中的資料內容除了實體及關係的資訊外更帶有位置的特徵資訊,將有利於後續訓練模型的效率及時間。Accordingly, the processor 15 can perform the same operation on multiple fields contained in the marked
於某些實施方式中,處理器15亦可以其他排列方式以產生組合,本領域具有通常知識者應可根據前述說明內容理解如何根據其他排列產生組合,茲不贅言。須說明者,為簡明表示,第3圖中所例示的改良式標註格式僅示出部分內容,但其非用以限制本發明之範圍,本領域具有通常知識者應可根據前述說明內容理解運作的內容,茲不贅言。In some embodiments, the processor 15 can also generate combinations in other arrangements, and those skilled in the art should be able to understand how to generate combinations based on other arrangements based on the foregoing description, so no further details are given here. It should be noted that for the sake of brevity, the improved label format illustrated in Figure 3 only shows part of the content, but it is not intended to limit the scope of the present invention, and those skilled in the art should be able to understand the operation according to the foregoing description The content is not repeated here.
以下將說明,處理器15執行訓練模型程序的步驟,請參考第4圖。於本實施方式中,處理器15以一預訓練語言模型413為基礎,將該等組合輸入至預訓練語言模型413,以產生一實體關係抽取模型,其中該實體抽取模型用以識別一文本段落中之該實體資訊及該關係資訊。須說明者,預訓練語言模型413至少包含一已訓練完成的語言層模型,由於已基於大量文本訓練複數層的網絡結構,該語言層已包含多個已訓練權重的參數,例如:Google提出的預訓練語言模型 BERT(Bidirectional Encoder Representations from Transformers),其中的各該「Transformers」是一個利用自我注意力機制以加強關注序列內部關聯的模型。The steps of executing the training model program by the processor 15 will be described below, please refer to FIG. 4 . In this embodiment, the processor 15 is based on a
具體而言,訓練模型程序可包含以下步驟。首先,如第4圖所示,處理器15將一輸入層411及一序列層415與預訓練語言模型413串接以有效降低模型訓練之一複雜度,其中輸入層411用以將該等字段切分為複數個詞彙以作為預訓練語言模型413的輸入,序列層415基於該改良式標註格式執行一分析運作以產生該文本段落中之該實體資訊及該關係資訊。接著,處理器15將實體關係資料庫400中的該等組合輸入至該輸入層411,配合該預訓練語言模型413及該序列層415,以產生該實體關係抽取模型。Specifically, the training model program may include the following steps. First, as shown in FIG. 4, the processor 15 connects an
須說明者,輸入層411是輸入複數個文字序列(即,實體關係資料庫400中的輸入資料),將文字序列拆成複數個詞彙(Token)序列,再將詞彙序列輸入至預訓練語言模型413(即,BERT層),序列層415則接收預訓練語言模型413的輸出,最後產生對應各該文字序列的實體關係及傳統序列標註格式(例如:BMES、BIO、BIOES等等)的標註結果。由於序列層(CRF Layer)可對於序列化標籤添加一些約束條件(即,規範下一個字產生的可能性限制),進而保證預測標籤的有效性,且有效減少模型訓練的複雜度。因此,將序列層串接在語言層(即,BERT層)後面能夠加強序列分析的效果。須說明者,為簡明表示,第4圖中僅示出部分內容,本領域具有通常知識者應可根據前述說明內容,理解透過類神經網路串接以進行機器學習訓練的運作內容,茲不贅言。It should be noted that the
於某些實施方式中,如第4圖所示,可透過由輸入層411、預訓練語言模型413、序列層415三個網路串接的一神經網路(Neural Network)409進行機器學習,基於實體關係資料庫400的資料對預訓練語言模型413進行模型微調(fine-tuning),以訓練實體關係抽取模型,該實體關係抽取模型輸入為一段文字序列及標註資訊,其訓練完畢的模型可預測新的文字序列當中有哪些字詞為實體及關係。In some embodiments, as shown in FIG. 4 , machine learning can be performed through a neural network (Neural Network) 409 connected in series by three networks of an
由上述說明可知,產生實體關係抽取模型的裝置1執行包含預標註程序及訓練模型程序。在預標註程序中,由處理器15基於待標註文本133中的複數個字段以及實體關係資料庫400中的該等實體資訊與該等關係資訊,產生對應各該字段之至少一待標註實體資訊以及對應各該字段之至少一待標註關係資訊,並根據改良式標註格式對各該字段之該至少一待標註實體資訊及該至少一待標註關係資訊進行標註,以產生至少一標註後實體資訊及至少一標註後關係資訊,且由該至少一標註後實體資訊與該至少一標註後關係資訊產生複數個組合且儲存至實體關係資料庫400。在訓練模型程序中,由處理器15以預訓練語言模型為基礎,將該等組合輸入至該預訓練語言模型,以產生一實體關係抽取模型。It can be seen from the above description that the
由上述說明可知,傳統的實體關係抽取模型訓練通常需要重頭開始訓練,且需要經過大量的人工標註/介入所產生的輸入資料才能達成效果。有別於傳統的模型產生方式,本發明所提供之產生實體關係抽取模型裝置架構在預訓練模型上,並透過預標註程序的機制,快速標註輸入資料並擴增實體關係資料庫,自動化的產生大量的資料,而不需要人力的介入,因而使得實體關係抽取模型能夠快速地被訓練。此外,本發明更透過改良式標註格式的資訊,加速實體關係抽取模型的訓練速度。因而解決了習知技術中,實體關係抽取模型均需要專家或學者的介入,耗費大量的人工標註成本及時間,亦無法快速且彈性的針對不同領域轉換的缺點。It can be seen from the above description that the traditional entity relationship extraction model training usually needs to start training from scratch, and requires a lot of input data generated by manual labeling/intervention to achieve results. Different from the traditional model generation method, the device for generating entity relationship extraction model provided by the present invention is based on the pre-training model, and through the mechanism of pre-labeling program, it can quickly label the input data and expand the entity relationship database, automatically generating A large amount of data does not require human intervention, thus enabling the entity relationship extraction model to be trained quickly. In addition, the present invention further accelerates the training speed of the entity relationship extraction model through the information in the improved label format. Therefore, in the conventional technology, the entity relationship extraction model requires the intervention of experts or scholars, consumes a lot of manual labeling costs and time, and cannot quickly and flexibly switch between different fields.
本發明之第二實施方式為一種產生實體關係抽取模型的方法,其流程圖係描繪於第5圖。產生實體關係抽取模型的方法用於一產生實體關係抽取模型的裝置(下稱:該裝置),例如:第一實施方式所述之產生實體關係抽取模型的裝置1。該裝置包含一儲存器、一收發介面及一處理器,該儲存器儲存一實體關係資料庫,例如:第一實施方式所述之實體關係資料庫400,其中該實體關係資料庫至少包含複數個實體資訊及複數個關係資訊。產生實體關係抽取模型的方法透過預標註程序的步驟S501至步驟S507及訓練模型程序步驟S509,產生實體關係抽取模型。The second embodiment of the present invention is a method for generating an entity relationship extraction model, the flow chart of which is depicted in FIG. 5 . The method for generating an entity-relationship extraction model is used in a device for generating an entity-relationship extraction model (hereinafter referred to as the device), for example: the
於某些實施方式中,該實體關係資料庫是由一爬蟲程序及一實體關係資料庫建構程序產生,其中執行該爬蟲程序包含下列步驟:收集複數個知識庫資料內容,各該知識庫資料內容包含複數個條目名稱及對應各該條目名稱之一條目內文;對該各該條目內文進行一斷句處理,以產生一輸入資料;其中,該實體關係資料庫建構程序包含下列步驟:將該輸入資料輸入至一實體關係抽取系統,以產生一輸出資料,其中該輸出資料包含複數筆三元組資料,各該三元組資料包含複數個實體資訊、至少一關係資訊及一信心分數;基於該信心分數,將輸出資料中該信心分數超越一預設值的該等三元組資料儲存至該實體關係資料庫。In some embodiments, the entity-relationship database is generated by a crawler program and an entity-relationship database construction program, wherein executing the crawler program includes the following steps: collecting a plurality of knowledge base data contents, and each knowledge base data content Contains a plurality of item titles and an item content corresponding to each of the item names; performing a sentence segmentation process on each of the item content to generate an input data; wherein, the entity relationship database construction procedure includes the following steps: The input data is input to an entity relationship extraction system to generate an output data, wherein the output data includes a plurality of triple data, and each triple data includes a plurality of entity information, at least one relationship information and a confidence score; based on For the confidence score, store the triple data in the output data with the confidence score exceeding a preset value in the entity relationship database.
以下先說明預標註程序的步驟S501至步驟S507。首先,於步驟S501,由該裝置接收一待標註文本。Steps S501 to S507 of the pre-labeling procedure will be described below. First, in step S501, a text to be marked is received by the device.
接著,於步驟S503,由該裝置根基於該待標註文本中的複數個字段以及該實體關係資料庫中的該等實體資訊與該等關係資訊,產生對應各該字段之至少一待標註實體資訊以及對應各該字段之至少一待標註關係資訊。於某些實施方式中,產生對應各該字段之該至少一待標註實體資訊以及對應各該字段之該至少一待標註關係資訊係包含下列步驟:比對該待標註文本中的該等字段以及該實體關係資料庫中的該等實體資訊,以產生對應各該字段之該至少一待標註實體資訊;以及比對包含至少二個待標註實體資訊的各該字段以及該實體關係資料庫中的該等關係資訊,以產生對應各該字段之該至少一待標註關係資訊。Next, in step S503, the device generates at least one entity information to be labeled corresponding to each field based on the plurality of fields in the text to be labeled and the entity information and the relationship information in the entity relationship database and at least one relationship information to be marked corresponding to each of the fields. In some embodiments, generating the at least one entity information to be labeled corresponding to each of the fields and the at least one relationship information to be labeled corresponding to each of the fields includes the following steps: comparing the fields in the text to be labeled and The entity information in the entity relationship database to generate the at least one entity information to be marked corresponding to each of the fields; and compare each of the fields containing at least two entity information to be marked with the entity relationship database. The relationship information is used to generate the at least one to-be-marked relationship information corresponding to each of the fields.
隨後,於步驟S505,由該裝置根據一改良式標註格式對各該字段之該至少一待標註實體資訊及該至少一待標註關係資訊進行標註,以產生至少一標註後實體資訊及至少一標註後關係資訊。於某些實施方式中,該改良式標註格式是由一傳統序列標註格式及對應該傳統序列標註格式之一實體標籤及關係標籤所組成。Subsequently, in step S505, the device marks the at least one entity information to be marked and the at least one relational information to be marked in each field according to an improved marking format, so as to generate at least one marked entity information and at least one mark post relationship information. In some embodiments, the improved annotation format is composed of a traditional sequence annotation format and an entity label and a relationship label corresponding to the traditional sequence annotation format.
接著,於步驟S507,由該裝置由該至少一標註後實體資訊與該至少一標註後關係資訊產生複數個組合且儲存至該實體關係資料庫。於某些實施方式中,由該至少一標註後實體資訊與該至少一標註後關係資訊產生該等組合係包含下列步驟:根據各該字段之該至少一標註後實體資訊及該至少一標註後關係資訊於該字段的一先後順序,產生各該字段中的該至少一標註後實體資訊與該至少一標註後關係資訊的該等個組合。Next, in step S507, the device generates a plurality of combinations from the at least one labeled entity information and the at least one labeled relationship information and stores them in the entity relationship database. In some embodiments, generating the combinations from the at least one marked entity information and the at least one marked relationship information includes the following steps: according to the at least one marked entity information and the at least one marked entity information in each of the fields A sequence of relational information in the field generates the combinations of the at least one tagged entity information and the at least one tagged relationship information in each of the fields.
以下接著說明訓練模型程序步驟S509。於步驟S509中,由該裝置以一預訓練語言模型為基礎,將該等組合輸入至該預訓練語言模型,以產生一實體關係抽取模型。Next, step S509 of the training model procedure will be described. In step S509, based on a pre-trained language model, the device inputs the combinations into the pre-trained language model to generate an entity relationship extraction model.
於某些實施方式中,該訓練模型程序更包含:將一輸入層及一序列層與該預訓練語言模型串接以有效降低模型訓練之一複雜度,其中該輸入層用以將該等字段切分為複數個詞彙以作為該預訓練語言模型的輸入,該序列層基於該改良式標註格式執行一分析運作以產生該文本段落中之該實體資訊及該關係資訊;以及將實體關係資料庫中包含該改良式標註格式的該等組合輸入至該輸入層,配合該預訓練語言模型及該序列層,以產生該實體關係抽取模型。In some embodiments, the training model program further includes: concatenating an input layer and a sequence layer with the pre-trained language model to effectively reduce the complexity of model training, wherein the input layer is used to convert these fields Segmenting into a plurality of words as the input of the pre-trained language model, the sequence layer performs an analysis operation based on the improved annotation format to generate the entity information and the relationship information in the text paragraph; and the entity relationship database The combinations including the improved annotation format are input to the input layer, and cooperate with the pre-trained language model and the sequence layer to generate the entity relationship extraction model.
除了上述步驟,第二實施方式亦能執行第一實施方式所描述之產生實體關係抽取模型的裝置1之所有運作及步驟,具有同樣之功能,且達到同樣之技術效果。本發明所屬技術領域中具有通常知識者可直接瞭解第二實施方式如何基於上述第一實施方式以執行此等運作及步驟,具有同樣之功能,並達到同樣之技術效果,故不贅述。In addition to the above steps, the second embodiment can also perform all the operations and steps of the
綜上所述,傳統的實體關係抽取模型訓練通常需要重頭開始訓練,且需要經過大量的人工標註/介入所產生的輸入資料才能達成效果。有別於傳統的模型產生方式,本發明所提供之產生實體關係抽取模型方法架構在預訓練模型上,並透過預標註程序的機制,快速標註輸入資料並擴增實體關係資料庫,自動化的產生大量的資料,而不需要人力的介入,因而使得實體關係抽取模型能夠快速地被訓練。此外,本發明更透過改良式標註格式的資訊,加速實體關係抽取模型的訓練速度。因而解決了習知技術中,實體關係抽取模型均需要專家或學者的介入,耗費大量的人工標註成本及時間,亦無法快速且彈性的針對不同領域轉換的缺點。To sum up, traditional entity relationship extraction model training usually needs to start training from scratch, and requires a lot of input data generated by manual labeling/intervention to achieve results. Different from the traditional model generation method, the method for generating entity relationship extraction model provided by the present invention is based on the pre-training model, and through the mechanism of pre-labeling program, it can quickly label the input data and expand the entity relationship database, automatically generating A large amount of data does not require human intervention, thus enabling the entity relationship extraction model to be trained quickly. In addition, the present invention further accelerates the training speed of the entity relationship extraction model through the information in the improved label format. Therefore, in the conventional technology, the entity relationship extraction model requires the intervention of experts or scholars, consumes a lot of manual labeling costs and time, and cannot quickly and flexibly switch between different fields.
上述實施方式僅用來例舉本發明之部分實施態樣,以及闡釋本發明之技術特徵,而非用來限制本發明之保護範疇及範圍。任何本發明所屬技術領域中具有通常知識者可輕易完成之改變或均等性之安排均屬於本發明所主張之範圍,而本發明之權利保護範圍以申請專利範圍為準。The above embodiments are only used to exemplify some implementations of the present invention and explain the technical features of the present invention, rather than to limit the scope and scope of the present invention. Any changes or equivalence arrangements that can be easily accomplished by those with ordinary knowledge in the technical field of the present invention belong to the scope claimed by the present invention, and the scope of protection of the rights of the present invention is subject to the scope of the patent application.
1:產生實體關係抽取模型的裝置 11:儲存器 13:收發介面 15:處理器 133:待標註文本 400:實體關係資料庫 409:神經網路 411:輸入層 413:預訓練語言模型 415:序列層 S501-S509:步驟 1: A device for generating an entity-relationship extraction model 11: Storage 13: Sending and receiving interface 15: Processor 133: Text to be marked 400: Entity Relationship Database 409: Neural Networks 411: Input layer 413: Pre-trained language model 415:Sequence layer S501-S509: Steps
第1圖係描繪依據本發明一實施例之產生實體關係抽取模型的裝置之架構示意圖; 第2圖係描繪第一實施方式中實體關係資料庫之示意圖; 第3圖係描繪第一實施方式中擴增後的實體關係資料庫之示意圖; 第4圖係描繪第一實施方式中訓練實體關係抽取模型的架構之示意圖;以及 第5圖係描繪第二實施方式之產生實體關係抽取模型的方法之流程圖。 Figure 1 is a schematic diagram depicting the structure of a device for generating an entity-relationship extraction model according to an embodiment of the present invention; Figure 2 is a schematic diagram depicting the entity relationship database in the first embodiment; Fig. 3 is a schematic diagram depicting the enlarged entity-relationship database in the first embodiment; Fig. 4 is a schematic diagram depicting the architecture of training the entity relationship extraction model in the first embodiment; and FIG. 5 is a flowchart depicting a method for generating an entity relationship extraction model according to the second embodiment.
S501~S509:步驟 S501~S509: steps
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110131864A TWI807400B (en) | 2021-08-27 | 2021-08-27 | Apparatus and method for generating an entity-relation extraction model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110131864A TWI807400B (en) | 2021-08-27 | 2021-08-27 | Apparatus and method for generating an entity-relation extraction model |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202309771A true TW202309771A (en) | 2023-03-01 |
TWI807400B TWI807400B (en) | 2023-07-01 |
Family
ID=86690780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW110131864A TWI807400B (en) | 2021-08-27 | 2021-08-27 | Apparatus and method for generating an entity-relation extraction model |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI807400B (en) |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8504490B2 (en) * | 2010-04-09 | 2013-08-06 | Microsoft Corporation | Web-scale entity relationship extraction that extracts pattern(s) based on an extracted tuple |
CN107133226B (en) * | 2016-02-26 | 2021-12-07 | 阿里巴巴集团控股有限公司 | Method and device for distinguishing themes |
CN110442725B (en) * | 2019-08-14 | 2022-02-25 | 科大讯飞股份有限公司 | Entity relationship extraction method and device |
CN111639185B (en) * | 2020-06-04 | 2023-06-02 | 虎博网络技术(上海)有限公司 | Relation information extraction method, device, electronic equipment and readable storage medium |
CN111881256B (en) * | 2020-07-17 | 2022-11-08 | 中国人民解放军战略支援部队信息工程大学 | Text entity relation extraction method and device and computer readable storage medium equipment |
CN113011161A (en) * | 2020-12-29 | 2021-06-22 | 中国航天科工集团第二研究院 | Method for extracting human and pattern association relation based on deep learning and pattern matching |
CN112328812B (en) * | 2021-01-05 | 2021-03-26 | 成都数联铭品科技有限公司 | Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment |
CN113051922A (en) * | 2021-04-20 | 2021-06-29 | 北京工商大学 | Triple extraction method and system based on deep learning |
CN113221567A (en) * | 2021-05-10 | 2021-08-06 | 北京航天情报与信息研究所 | Judicial domain named entity and relationship combined extraction method |
-
2021
- 2021-08-27 TW TW110131864A patent/TWI807400B/en active
Also Published As
Publication number | Publication date |
---|---|
TWI807400B (en) | 2023-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
WO2019137033A1 (en) | Automatic construction method for software bug oriented domain knowledge graph | |
CN104050256B (en) | Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method | |
CN108460011B (en) | Entity concept labeling method and system | |
TWI662425B (en) | A method of automatically generating semantic similar sentence samples | |
WO2018000272A1 (en) | Corpus generation device and method | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN113642330A (en) | Rail transit standard entity identification method based on catalog topic classification | |
CN112417891B (en) | Text relation automatic labeling method based on open type information extraction | |
CN109062904B (en) | Logic predicate extraction method and device | |
CN112818093A (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN115203421A (en) | Method, device and equipment for generating label of long text and storage medium | |
CN116595195A (en) | Knowledge graph construction method, device and medium | |
CN112257442A (en) | Policy document information extraction method based on corpus expansion neural network | |
CN111178080A (en) | Named entity identification method and system based on structured information | |
Zhang et al. | A named entity recognition method towards product reviews based on BiLSTM-attention-CRF | |
CN112651234A (en) | Semi-open information extraction method and device | |
TWI807400B (en) | Apparatus and method for generating an entity-relation extraction model | |
CN115730071A (en) | Electric power public opinion event extraction method and device, electronic equipment and storage medium | |
Bhuiyan et al. | An effective approach to generate Wikipedia infobox of movie domain using semi-structured data | |
Mondal et al. | Natural language query to NoSQL generation using query-response model | |
Phan et al. | Automated data extraction from the web with conditional models | |
CN114328863A (en) | Long text retrieval method and system based on Gaussian kernel function | |
CN115481240A (en) | Data asset quality detection method and detection device |