TWI414950B - System and method for parsing texts - Google Patents

System and method for parsing texts Download PDF

Info

Publication number
TWI414950B
TWI414950B TW97104022A TW97104022A TWI414950B TW I414950 B TWI414950 B TW I414950B TW 97104022 A TW97104022 A TW 97104022A TW 97104022 A TW97104022 A TW 97104022A TW I414950 B TWI414950 B TW I414950B
Authority
TW
Taiwan
Prior art keywords
block
text
document
blocks
sub
Prior art date
Application number
TW97104022A
Other languages
Chinese (zh)
Other versions
TW200935252A (en
Inventor
Chung I Lee
Chien Fa Yeh
Chiu Hua Lu
Xiao-Di Fan
Xiao-Ping Zhang
Original Assignee
Hon Hai Prec Ind Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hon Hai Prec Ind Co Ltd filed Critical Hon Hai Prec Ind Co Ltd
Priority to TW97104022A priority Critical patent/TWI414950B/en
Publication of TW200935252A publication Critical patent/TW200935252A/en
Application granted granted Critical
Publication of TWI414950B publication Critical patent/TWI414950B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A method for parsing texts is disclosed. The method includes steps of: creating a text-described file; loading the text-described file and the text to be parsed; parsing the text according to the text-described file and extracting data from the text; outputting the data. A related system is also disclosed.

Description

文本解析系統及方法 Text analysis system and method

本發明涉及一種文本處理系統及方法,尤其關於一種文本解析系統及方法。 The present invention relates to a text processing system and method, and more particularly to a text parsing system and method.

資訊系統通常透過文本傳遞資訊,透過文本傳遞資訊具有平臺無關、移植性好、相容性強的特點,並且可讀性也符合人的視覺理解方式。文本解析之目的係對文本進行分解,提取出其中的資料,以便於進行二次加工,或者進行諸如檢索、文本挖掘等深層次的加工和服務。如圖1所示的電子報表,該電子報表包含多個表單(圖1中只列出表單1與表單2),每個表單係一個獨立部分,由標題、出處、製表日期、廠商編號、廠商名稱及資料表構成,資料表包含多條記錄,每條記錄包括收貨單號、收貨日期、料號、採購單號、收貨量、收貨單價,對諸如此類的電子報表進行分析統計,必須把報表中的資料(如圖1中的標題、出處、製表日期、廠商編號、廠商名稱以及每條記錄的收貨單號、收貨日期、料號、採購單號、收貨量、收貨單價)提取出來並按其意義重新組織成資訊系統所能識別的資料結構。習知的文本分解過程難以規範化,必須由程式師定制化實現,容易出現錯誤,尤其當文本格式複雜、資料量大的時候難以對文本進行解析。 The information system usually transmits information through text, and the information transmitted through the text is platform-independent, portable, and compatible, and the readability is also in line with the human visual understanding. The purpose of text analysis is to decompose the text, extract the data from it, to facilitate secondary processing, or to perform deep processing and services such as retrieval and text mining. As shown in the electronic report shown in Figure 1, the electronic report contains multiple forms (only Form 1 and Form 2 are listed in Figure 1), each form is a separate part, with title, source, date of tabulation, vendor number, The name of the manufacturer and the data sheet are composed. The data sheet contains multiple records. Each record includes the receipt number, the date of receipt, the item number, the purchase order number, the quantity received, the unit price of the goods received, and the electronic report such as this is analyzed and counted. , must be the information in the report (such as the title, source, date of the table, manufacturer number, manufacturer name and receipt number of each record, receipt date, item number, purchase order number, receipt quantity The receipt unit price is extracted and reorganized into the data structure that the information system can recognize according to its meaning. The conventional text decomposition process is difficult to standardize and must be customized by the programmer, which is prone to errors, especially when the text format is complex and the amount of data is large.

鑒於以上內容,有必要提出一種通用的文本解析系統及方法。 In view of the above, it is necessary to propose a general text analysis system and method.

一種文本解析系統,該系統包括:載入模組,用於載入文本描述定義文檔及所需解析的文本,所述之文本描述定義文檔定義文本的各個區塊及各個資料的匹配規則,各個區塊形成多層的樹狀結構,頂層係根區塊,底層係最小單位區塊;解析模組,用於從根區塊開始,根據各個區塊的匹配規則匹配文本的各個區塊,若匹配出的區塊包含子區塊,則根據子區塊的匹配規則在所述匹配出的區塊中匹配所述子區塊,直到匹配出所有區塊,並且根據相應資料的匹配規則從各個最小單位區塊中提取出各個資料;及輸出模組,用於將所提取的資料以用戶所需的文檔格式輸出。 A text parsing system, the system comprising: a loading module, configured to load a text description definition document and a text to be parsed, the text description defining respective blocks of the document definition text and matching rules of each material, each The block forms a multi-layered tree structure, the top layer is the root block, and the bottom layer is the smallest unit block; the parsing module is used to start from the root block and match each block of the text according to the matching rule of each block, if the matching The out-out block includes the sub-block, and the sub-block is matched in the matched block according to the matching rule of the sub-block until all the blocks are matched, and the minimum is obtained according to the matching rule of the corresponding data. Each piece of data is extracted from the unit block; and an output module is used to output the extracted data in a document format required by the user.

一種文本解析方法,該方法包括步驟:建立文本描述定義文檔,所述之文本描述定義文檔定義各個區塊及各個資料的匹配規則,各個區塊形成多層的樹狀結構,頂層係根區塊,底層係最小單位區塊;載入文本描述定義文檔及所需解析的文本;從根區塊開始,根據各個區塊的匹配規則匹配文本的各個區塊,若匹配出的區塊包含子區塊,則根據子區塊的匹配規則在所述匹配出的區塊中匹配所述子區塊,直到匹配出所有區塊,並且根據相應資料的匹配規則從各個最小單位區塊中提取出各個資料;及將所提取的資料以用戶所需的文檔格式輸出。 A text parsing method, the method comprising the steps of: creating a text description definition document, wherein the text description defines a matching rule for defining each block and each material of the document, each block forming a multi-layered tree structure, and the top layer is a root block, The bottom layer is the smallest unit block; the loaded text description defines the document and the text to be parsed; starting from the root block, each block of the text is matched according to the matching rule of each block, and if the matched block contains the sub-block And matching the sub-blocks in the matched blocks according to the matching rule of the sub-blocks until all the blocks are matched, and extracting each data from each of the smallest unit blocks according to the matching rule of the corresponding data ; and output the extracted data in the document format required by the user.

相較於習知技術,本發明結合可擴展標記語言與正則運算式對文本的結構及匹配規則進行描述,根據所描述的文本結構及匹配規則分解文本,從文本中提取資料,從而實現對各種類型文本的解 析。 Compared with the prior art, the present invention combines the extensible markup language and the regular expression to describe the structure and matching rules of the text, decomposes the text according to the described text structure and matching rules, and extracts data from the text, thereby realizing various kinds of Solution to type text Analysis.

11‧‧‧載入模組 11‧‧‧Loading module

12‧‧‧解析模組 12‧‧‧analysis module

13‧‧‧輸出模組 13‧‧‧Output module

S301‧‧‧建立文本描述定義文檔 S301‧‧‧Create text description definition document

S302‧‧‧載入文本描述定義文檔 S302‧‧‧Load text description definition document

S303‧‧‧載入文本 S303‧‧‧Load text

S304‧‧‧提取資料 S304‧‧‧Extracting information

S305‧‧‧輸出資料 S305‧‧‧Output data

圖1係電子報表的示意圖。 Figure 1 is a schematic diagram of an electronic report.

圖2係本發明文本解析系統較佳實施例的功能模組圖。 2 is a functional block diagram of a preferred embodiment of the text analysis system of the present invention.

圖3係本發明文本解析方法較佳實施例的流程圖。 3 is a flow chart of a preferred embodiment of the text parsing method of the present invention.

參閱圖2所示,係本發明文本解析系統較佳實施例的功能模組圖。該系統包括載入模組11、解析模組12及輸出模組13。所述文本解析系統運行在電腦上,該系統根據文本描述定義文檔對文本進行分解,從文本中提取出資料,並輸出所提取的資料。所述資料係需要從文本中提取出來的具體資訊,以圖1中的電子報表為例,需要把每個表單的標題(如***有限公司進貨單)、出處(如H5S00001)、製錶日期(如20070601 17:32:03)、廠商編號(如9876543210)、廠商名稱(如***電子公司)以及每條記錄的收貨單號(如HaA-012345)、收貨日期(如20070512)、料號(如987654J00-001-BB)、採購單號(如Ord-111111)、收貨量(如2,400.00)、收貨單價(如12.45000)等資料提取出來。 Referring to FIG. 2, it is a functional block diagram of a preferred embodiment of the text analysis system of the present invention. The system includes a loading module 11, an analysis module 12, and an output module 13. The text parsing system runs on a computer, the system decomposes the text according to the text description definition document, extracts the data from the text, and outputs the extracted data. The data is specific information that needs to be extracted from the text. Taking the electronic report in Figure 1 as an example, the title of each form (such as the purchase order of *** Co., Ltd.), the source (such as H5S00001), and the tabulation are required. Date (such as 20070601 17:32:03), manufacturer number (such as 9876543210), manufacturer name (such as *** electronic company), and the receipt number of each record (such as HaA-012345), the date of receipt (such as 20070412) ), item number (such as 987654J00-001-BB), purchase order number (such as Ord-111111), receipt quantity (such as 2,400.00), receipt unit price (such as 12.45000) and other data extracted.

文本描述定義文檔描述文本的結構及匹配規則。在本實施例中,所述文本描述定義文檔係一種可擴展標記語言(Extensible Markup Language,XML)文檔,即*.xml。所述文本描述定義文檔定義多層的樹狀結構的區塊來描述文本的結構。頂層的區塊係根區塊,根區塊包含若干區塊,每一區塊又包含若干區塊,從而最終形成一種多層的樹狀結構。對於任一非根區塊,只有一個區 塊直接包含該非根區塊(稱為子區塊)。若一個區塊包含多個同樣結構的子區塊,定義該子區塊係列表類型。底層的區塊不包含子區塊,底層的區塊稱為最小單位區塊,最小單位區塊僅包含資料。例如,對於圖1中電子報表,用root表示根區塊,對應於整個電子報表,用table表示根區塊的子區塊,table定義為列表類型,對應於電子報表中的各個表單,table包括子區塊title、from、date、supplierId、supplierName、form,分別表示每個表單的標題、出處、製表日期、廠商編號、廠商名稱、資料表;form包括item子區塊,item定義為列表類型,對應於資料表中的各條記錄;item包括子區塊consignId、consignDate、productId、PoId、inAmount、price,分別表示每條記錄的收貨單號、收貨日期、料號、採購單號、收貨量、收貨單價。其中title、from、date、supplierId、supplierName、consignId、consignDate、productId、PoId、inAmount、price係最小單位區塊,該等區塊包含資料。例如圖1中的表單1,title為“***有限公司進貨單”、from為“H5S00001”、date為“20070601 17:32:03”、supplierId為“9876543210”、supplierName為“***電子公司”,表單1中的第一條記錄的consignId為“HaA-012345”、consignDate為“20070512”、productId為“987654J00-001-BB”、PoId為“Ord-111111”、inAmount為“2,400.00”、price為“12.45000”。對文本(如圖1所示電子報表)進行解析之目的係將文本中的資料提取出來。 The text description defines the structure of the document description text and the matching rules. In this embodiment, the text description definition document is an Extensible Markup Language (XML) document, ie, *.xml. The text description defines a block in which the document defines a plurality of hierarchical tree structures to describe the structure of the text. The top layer block is the root block, and the root block contains several blocks, each of which contains several blocks, thereby finally forming a multi-layered tree structure. For any non-root block, there is only one zone The block directly contains the non-root block (called a sub-block). If a block contains multiple sub-blocks of the same structure, the sub-block family table type is defined. The bottom block does not contain sub-blocks, the bottom block is called the smallest unit block, and the smallest unit block contains only data. For example, for the electronic report in FIG. 1, the root block is represented by root, corresponding to the entire electronic report, and the sub-block of the root block is represented by a table, and the table is defined as a list type, corresponding to each form in the electronic report, and the table includes The sub-block title, from, date, supplierId, supplierName, and form respectively indicate the title, source, tabulation date, vendor number, vendor name, and data table of each form; form includes item sub-block, and item is defined as list type. Corresponding to each record in the data table; item includes sub-blocks consignId, consignDate, productId, PoId, inAmount, and price, respectively indicating the receipt number, receipt date, item number, purchase order number of each record, Receipt quantity, receipt unit price. The title, from, date, supplierId, supplierName, consignId, consignDate, productId, PoId, inAmount, and price are the smallest unit blocks, and the blocks contain data. For example, in Form 1 of Figure 1, title is "*** Ltd. Purchase Order", from "H5S00001", date is "20070601 17:32:03", supplierId is "9876543210", supplierName is "*** Electronic "Company", the first record in Form 1 has a consignId of "HaA-012345", a consignDate of "20070512", a productId of "987654J00-001-BB", a PoId of "Ord-111111", and an inAmount of "2,400.00". The price is "12.45000". The purpose of parsing the text (the electronic report shown in Figure 1) is to extract the data from the text.

文本描述定義文檔採用正則運算式(regular expression, regexes)來描述各個區塊及各個資料的匹配規則,根據該匹配規則能夠從文本,例如圖1所示的電子報表中將各個區塊分解出來,並且能夠從最小單位區塊中將資料提取出來。 The text description defines the document as a regular expression (regular expression, Regexes) to describe the matching rules of each block and each data. According to the matching rule, each block can be decomposed from text, such as the electronic report shown in FIG. 1, and the data can be extracted from the smallest unit block. .

載入模組11用於載入文本描述定義文檔及所需解析的文本。在本實施例中,按行將文本載入到陣列中,陣列的一個元素對應文本的一行。假設圖1中的電子報表共50行,將該電子報表載入到陣列string,該陣列的元素係string[0]、string[1]、string[2]、……、st ring[48]、string[49],分別對應電子報表的第1行、第2行、第3行、……、第49行、第50行。 The loading module 11 is used to load a text description definition document and text to be parsed. In this embodiment, text is loaded into the array in rows, with one element of the array corresponding to one line of text. Assuming that the electronic report in Figure 1 has a total of 50 lines, the electronic report is loaded into the array string, and the elements of the array are string[0], string[1], string[2], ..., st ring[48], String[49] corresponds to the first row, the second row, the third row, the ..., the 49th row, and the 50th row of the electronic report.

解析模組12用於根據文本描述定義文檔中各個區塊的匹配規則匹配文本的各個區塊,並從最小單位區塊中提取資料。在匹配文本的各個區塊時,若區塊係列表類型,匹配出區塊列表,否則匹配出第一個符合匹配規則的區塊。解析模組12匹配文本各個區塊的具體實現方法如下:從根區塊開始匹配,若匹配出的區塊包含子區塊,則根據文本描述定義文檔在該區塊中匹配其各個子區塊,直到匹配出所有區塊,並且從最小單位區塊中提取資料。文本描述定義文檔定義了各個區塊的匹配規則,根據一個區塊的匹配規則在文本中查找符合該匹配規則的子文本(即文本中的一部分),即得到文本的該區塊。例如根據根區塊的匹配規則在文本中查找符合根區塊匹配規則的子文本,即得到文本的根區塊。需要說明的是,在本實施例中,按行將文本載入到陣列中,因而在陣列中進行匹配,匹配出的區塊以陣列元素為單位,例如匹配出根區塊係string[1]~string[49],即文本的第2行至第50行。以圖1中 的電子報表為例,對該電子報表進行匹配,首先得到根區塊root為string[1]~string[49];在string[1]~string[49]中匹配根區塊的子區塊,得到子區塊列表table[0](表示表單1)table[1](表示表單2),table[0]、table[1]分別係string[1]~string[24]、string[26]~string[49];繼續匹配table[0]、table[1]的子區塊,例如從string[1]~string[24]中得到table[0]的子區塊title、from、date、supplierId、supplierName、form分別係string[1]、string[2]、string[3]、string[5]、string[5]、string[9]~string[24],並且得到title為“***有限公司進貨單”、from為“H5S00001”、date為“20070601 17:32:03”、supplierId為“9876543210”、supplierName為“***電子公司”;form區塊包含item子區塊,item係列表類型,則繼續在區塊form(例如string[9]~string[24])中匹配item子區塊列表item[0]、item[1]、item[2]、item[3]、item[4]、item[5],例如分別得到string[10]、string[12]、string[14]、string[16]、string[18]、string[20],繼續匹配item的子區塊,如在string[10]中匹配item[0]的子區塊,得到consignId為“HaA-012345”、consignDate為“20070512”、productId為“987654J00-001-BB”、PoId為“Ord-111111”、inAmount為“2,400.00”、price為“12.45000”。 The parsing module 12 is configured to match each block of the text according to the matching rule of each block in the document according to the text description, and extract the data from the smallest unit block. When matching each block of the text, if the block series table type matches the block list, the first block matching the matching rule is matched. The specific implementation method of the parsing module 12 matching each block of text is as follows: starting from the root block, if the matched block includes sub-blocks, the document is defined according to the text description, and the sub-blocks are matched in the block according to the text description. Until all blocks are matched, and the data is extracted from the smallest unit block. The text description definition document defines the matching rules of each block, and finds the sub-text (that is, a part of the text) that matches the matching rule according to the matching rule of a block, that is, the block of the text is obtained. For example, according to the matching rule of the root block, the sub-text corresponding to the root block matching rule is searched in the text, that is, the root block of the text is obtained. It should be noted that, in this embodiment, the text is loaded into the array by row, and thus the matching is performed in the array, and the matched blocks are in units of array elements, for example, the root block system string[1] is matched. ~string[49], which is the 2nd to 50th lines of the text. In Figure 1 For example, the electronic report is matched. First, the root block root is string[1]~string[49]; in string[1]~string[49], the sub-block of the root block is matched. Get the sub-block list table[0] (representing form 1) table[1] (representing form 2), table[0], table[1] are respectively string[1]~string[24], string[26]~ String[49]; continue to match the sub-blocks of table[0], table[1], for example, get the sub-block title, from, date, supplierId, of table[0] from string[1]~string[24] supplierName, form are respectively string[1], string[2], string[3], string[5], string[5], string[9]~string[24], and get the title "***有限公司Purchase order", from is "H5S00001", date is "20070601 17:32:03", supplierId is "9876543210", supplierName is "*** electronic company"; form block contains item sub-block, item series table type , then continue to match the item sub-block list item[0], item[1], item[2], item[3], item[4] in the block form (for example, string[9]~string[24]) , item[5], for example, get string[10], string[12], string[14], string[ 16], string[18], string[20], continue to match the child sub-block, such as matching the sub-block of item[0] in string[10], get consignId as "HaA-012345", consignDate as " 20070512", productId is "987654J00-001-BB", PoId is "Ord-111111", inAmount is "2,400.00", and price is "12.45000".

輸出模組13用於將所提取的資料根據文本描述定義文檔以用戶所需的文檔格式輸出。在本實施例中,用XML文檔輸出所提取的資 料,例如對圖1中的電子報表進行解析,提取的資料包括每個表單的title、from、date、supplierId、supplierName以及資料表中每條記錄的consignId、consignDate、productId、PoId、inAmount、price,將提取的資料按照文本的結構以output.xml文檔輸出。在除本實施例以外的實施例中,輸出模組13還可以將提取的資料按欄位存儲到資料庫中,例如存儲到Excel表中。 The output module 13 is configured to output the extracted data according to the text description definition document in a document format required by the user. In this embodiment, the extracted capital is output by using an XML document. For example, the electronic report in FIG. 1 is parsed, and the extracted data includes the title, from, date, supplierId, supplierName of each form, and consignId, consignDate, productId, PoId, inAmount, price of each record in the data table. The extracted data is output in the output.xml document according to the structure of the text. In an embodiment other than the embodiment, the output module 13 can also store the extracted data in a database by a field, for example, in an Excel table.

參閱圖3所示,係本發明文本解析方法較佳實施例的流程圖。 Referring to Figure 3, there is shown a flow chart of a preferred embodiment of the text parsing method of the present invention.

步驟S301,建立文本描述定義文檔,該文本描述定義文檔描述文本的結構及匹配規則。在本實施例中,所述文本描述定義文檔係XML文檔,即*.xml。 Step S301, establishing a text description definition document, the text description defining a structure of the document description text and a matching rule. In this embodiment, the text description defines that the document is an XML document, ie, *.xml.

所述文本描述定義文檔定義多層的樹狀結構的區塊來描述文本的結構。頂層的區塊係根區塊,根區塊包含若干區塊,每一區塊又包含若干區塊,從而最終形成一種多層的樹狀結構。對於任一非根區塊,只有一個區塊直接包含該非根區塊(稱為子區塊)。若一個區塊包含多個同樣結構的子區塊,定義該子區塊係列表類型。底層的區塊不包含子區塊,底層的區塊稱為最小單位區塊,最小單位區塊僅包含資料。例如,對於圖1中電子報表,用root表示根區塊,對應於整個電子報表,用table表示根區塊的子區塊,table定義為列表類型,對應於電子報表中的各個表單,table包括子區塊title、from、date、supplierId、supplierName、form,分別表示每個表單的標題、出處、製表日期、廠商編號、廠商名稱、資料表;form包括item子區塊,item定義為列表類型,對應於資料表中的各條記錄;item包括子區塊consignId、 consignDate、productId、PoId、inAmount、price,分別表示每條記錄的收貨單號、收貨日期、料號、採購單號、收貨量、收貨單價。其中title、from、date、supplierId、supplierName、consignId、consignDate、productId、PoId、inAmount、price係最小單位區塊,該等區塊包含資料。例如圖1中的表單1,title為“***有限公司進貨單”、from為“H5S00001”、date為“20070601 17:32:03”、supplierId為“9876543210”、supplierName為“***電子公司”,表單1中的第一條記錄的consignId為“HaA-012345”、consignDate為“20070512”、productId為“987654J00-001-BB”、PoId為“Ord-111111”、inAmount為“2,400.00”、price為“12.45000”。對文本(如圖1所示電子報表)進行解析之目的係將文本中的資料提取出來。 The text description defines a block in which the document defines a plurality of hierarchical tree structures to describe the structure of the text. The top layer block is the root block, and the root block contains several blocks, each of which contains several blocks, thereby finally forming a multi-layered tree structure. For any non-root block, only one block directly contains the non-root block (called a sub-block). If a block contains multiple sub-blocks of the same structure, the sub-block family table type is defined. The bottom block does not contain sub-blocks, the bottom block is called the smallest unit block, and the smallest unit block contains only data. For example, for the electronic report in FIG. 1, the root block is represented by root, corresponding to the entire electronic report, and the sub-block of the root block is represented by a table, and the table is defined as a list type, corresponding to each form in the electronic report, and the table includes The sub-block title, from, date, supplierId, supplierName, and form respectively indicate the title, source, tabulation date, vendor number, vendor name, and data table of each form; form includes item sub-block, and item is defined as list type. , corresponding to each record in the data table; item includes sub-block consignId, consignDate, productId, PoId, inAmount, and price, respectively, indicate the receipt number, receipt date, item number, purchase order number, receipt quantity, and receipt unit price for each record. The title, from, date, supplierId, supplierName, consignId, consignDate, productId, PoId, inAmount, and price are the smallest unit blocks, and the blocks contain data. For example, in Form 1 of Figure 1, title is "*** Ltd. Purchase Order", from "H5S00001", date is "20070601 17:32:03", supplierId is "9876543210", supplierName is "*** Electronic "Company", the first record in Form 1 has a consignId of "HaA-012345", a consignDate of "20070512", a productId of "987654J00-001-BB", a PoId of "Ord-111111", and an inAmount of "2,400.00". The price is "12.45000". The purpose of parsing the text (the electronic report shown in Figure 1) is to extract the data from the text.

文本描述定義文檔採用正則運算式來描述各個區塊及各個資料的匹配規則,根據該匹配規則能夠從文本,例如圖1所示的電子報表中將各個區塊分解出來,並且能夠從最小單位區塊中將資料提取出來。 The text description definition document uses a regular expression to describe the matching rules of each block and each material. According to the matching rule, each block can be decomposed from text, such as the electronic report shown in FIG. 1, and can be from the smallest unit area. The data is extracted from the block.

步驟S302,載入模組11載入文本描述定義文檔,如本實施例中載入*.xml文件。 In step S302, the loading module 11 loads the text description definition document, as in the embodiment, the *.xml file is loaded.

步驟S303,載入模組11載入文本。在本實施例中按行將文本載入到陣列中,陣列的一個元素對應文本的一行。假設圖1中的電子報表共50行,將該電子報表載入到陣列string,該陣列的元素係string[0]、string[1]、string[2]、……、string[48]、 string[49],分別對應電子報表的第1行、第2行、第3行、……、第49行、第50行。 In step S303, the loading module 11 loads the text. In this embodiment, text is loaded into the array in rows, one element of the array corresponding to one line of text. Assuming that the electronic report in Figure 1 has a total of 50 lines, the electronic report is loaded into the array string, and the elements of the array are string[0], string[1], string[2], ..., string[48], String[49] corresponds to the first row, the second row, the third row, the ..., the 49th row, and the 50th row of the electronic report.

步驟S304,解析模組12根據文本描述定義文檔中各個區塊的匹配規則匹配文本的各個區塊,並從最小單位區塊中提取資料。在匹配文本的各個區塊時,若區塊係列表類型,匹配出區塊列表,否則匹配出第一個符合匹配規則的區塊。該步驟具體實現方法如下:從根區塊開始匹配,若匹配出的區塊包含子區塊,則根據文本描述定義文檔在該區塊中匹配其各個子區塊,直到匹配出所有區塊,並且從最小單位區塊中提取資料。文本描述定義文檔定義了各個區塊的匹配規則,根據一個區塊的匹配規則在文本中查找符合該匹配規則的子文本(即文本中的一部分),即得到文本的該區塊。例如根據根區塊的匹配規則在文本中查找符合根區塊匹配規則的子文本,即得到文本的根區塊。需要說明的是,在本實施例中按行將文本載入到陣列中,因而係在陣列中進行匹配,匹配出的區塊以陣列元素為單位,例如匹配出根區塊係string[1]~string[49],即文本的第2行至第50行。以圖1中的電子報表為例,對該電子報表進行匹配,首先得到根區塊root為string[1]~string[49];在string[1]~string[49]中匹配根區塊的子區塊,得到子區塊列表table[0](表示表單1)table[1](表示表單2),table[0]、table[1]分別係string[1]~string[24]、string[26]~string[49];繼續匹配table[0]、table[1]的子區塊,例如從string[1]~string[24]中得到table[0]的子區塊title、from、date、supplierId、 supplierName、form分別係string[1]、string[2]、string[3]、string[5]、string[5]、string[9]~string[24],並且得到title為“***有限公司進貨單”、from為“H5S00001”、date為“20070601 17:32:03”、supplierId為“9876543210”、supplierName為“***電子公司”;form區塊包含item子區塊,item係列表類型,則繼續在區塊form(例如string[9]~string[24])中匹配item子區塊列表item[0]、item[1]、item[2]、item[3]、item[4]、item[5],例如分別得到string[10]、string[12]、string[14]、string[16]、string[18]、string[20],繼續匹配item的子區塊,如在string[10]中匹配item[0]的子區塊,得到consignId為“HaA-012345”、consignDate為“20070512”、productId為“987654J00-001-BB”、PoId為“Ord-111111”、inAmount為“2,400.00”、price為“12.45000”。 Step S304, the parsing module 12 defines each block of the matching text according to the matching rule of each block in the document according to the text description, and extracts the data from the smallest unit block. When matching each block of the text, if the block series table type matches the block list, the first block matching the matching rule is matched. The specific implementation method of the step is as follows: starting from the root block, if the matched block includes the sub-block, the document is defined according to the text description, and each sub-block is matched in the block until all the blocks are matched. And extract data from the smallest unit block. The text description definition document defines the matching rules of each block, and finds the sub-text (that is, a part of the text) that matches the matching rule according to the matching rule of a block, that is, the block of the text is obtained. For example, according to the matching rule of the root block, the sub-text corresponding to the root block matching rule is searched in the text, that is, the root block of the text is obtained. It should be noted that, in this embodiment, the text is loaded into the array by row, and thus the matching is performed in the array, and the matched blocks are in units of array elements, for example, the root block system string[1] is matched. ~string[49], which is the 2nd to 50th lines of the text. Taking the electronic report in Figure 1 as an example, the electronic report is matched. First, the root block root is string[1]~string[49]; in the string[1]~string[49], the root block is matched. Sub-block, get the sub-block list table[0] (representing form 1) table[1] (representing form 2), table[0], table[1] are respectively string[1]~string[24], string [26]~string[49]; continue to match the sub-blocks of table[0], table[1], for example, get the sub-block title of table[0] from string[1]~string[24], from, Date, supplierId, supplierName, form are respectively string[1], string[2], string[3], string[5], string[5], string[9]~string[24], and get the title "***有限公司Purchase order", from is "H5S00001", date is "20070601 17:32:03", supplierId is "9876543210", supplierName is "*** electronic company"; form block contains item sub-block, item series table type , then continue to match the item sub-block list item[0], item[1], item[2], item[3], item[4] in the block form (for example, string[9]~string[24]) , item[5], for example, get string[10], string[12], string[14], string[16], string[18], string[20], continue to match the child sub-block, as in string Matching the sub-block of item[0] in [10], the consignId is “HaA-012345”, the consignDate is “20070512”, the productId is “987654J00-001-BB”, the PoId is “Ord-111111”, and the inAmount is “ 2,400.00", the price is "12.45000".

步驟S305,輸出模組13將所提取的資料根據文本描述定義文檔以用戶所需的文檔格式輸出。在本實施例中用XML文檔輸出所提取的資料,例如對圖1中的電子報表進行解析,提取的資料包括每個表單的title、from、date、supplierId、supplierName以及資料表中每條記錄的consignId、consignDate、productId、PoId、inAmount、price,將提取的資料按照文本的結構以output.xml文檔輸出。在除本實施例以外的實施例中,輸出模組13還可以將提取的資料按欄位存儲到資料庫中,例如存儲到Excel表中。 In step S305, the output module 13 outputs the extracted data according to the text description definition document in a document format required by the user. In this embodiment, the extracted data is output by using an XML document, for example, the electronic report in FIG. 1 is parsed, and the extracted data includes title, from, date, supplierId, supplierName, and each record in the data table of each form. consignId, consignDate, productId, PoId, inAmount, price, the extracted data is output in the output.xml document according to the structure of the text. In an embodiment other than the embodiment, the output module 13 can also store the extracted data in a database by a field, for example, in an Excel table.

依照上述方法,對於各種類型的文本,即使係格式複雜、資料量大的文本,能夠用文本描述定義文檔描述其結構,並且能夠根據文本描述定義文檔將文本中的資料提取出來。 According to the above method, for various types of text, even if the text is complicated in format and large in volume, the text description can define the structure of the document description, and the document can be extracted according to the text description definition document.

本發明文本解析系統及方法,雖以較佳實施例揭露如上,然其並非用以限定本發明。任何熟悉此項技藝之人士,在不脫離本發明之精神及範圍內,當可做更動與潤飾,因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 The text analysis system and method of the present invention are disclosed above in the preferred embodiments, and are not intended to limit the present invention. Any person skilled in the art will be able to make changes and refinements without departing from the spirit and scope of the invention, and the scope of the invention is defined by the scope of the appended claims.

S301‧‧‧建立文本描述定義文檔 S301‧‧‧Create text description definition document

S302‧‧‧載入文本描述定義文檔 S302‧‧‧Load text description definition document

S303‧‧‧載入文本 S303‧‧‧Load text

S304‧‧‧提取資料 S304‧‧‧Extracting information

S305‧‧‧輸出資料 S305‧‧‧Output data

Claims (6)

一種文本解析系統,該系統包括:載入模組,用於載入文本描述定義文檔及將所需解析的文本載入到陣列中,陣列的一個元素對應文本的一行,所述之文本描述定義文檔定義文本的各個區塊及各個資料的匹配規則,各個區塊形成多層的樹狀結構,頂層係根區塊,底層係最小單位區塊;解析模組,用於從根區塊開始,根據各個區塊的匹配規則在所述陣列中匹配文本的各個區塊,若匹配出的區塊包含子區塊,則根據子區塊的匹配規則在所述匹配出的區塊中匹配所述子區塊,直到匹配出所有區塊,並且根據相應資料的匹配規則從各個最小單位區塊中提取出各個資料;及輸出模組,用於將所提取的資料以用戶所需的文檔格式輸出。 A text parsing system, the system comprising: a loading module for loading a text description definition document and loading the text to be parsed into an array, one element of the array corresponding to one line of text, the text description definition The document defines the various blocks of the text and the matching rules of each data. Each block forms a multi-layered tree structure, the top layer is the root block, and the bottom layer is the smallest unit block; the parsing module is used to start from the root block, according to Matching rules of the respective blocks match the respective blocks of the text in the array, and if the matched blocks include the sub-blocks, the sub-blocks are matched in the matched blocks according to the matching rule of the sub-blocks The block, until all the blocks are matched, and each piece of data is extracted from each of the smallest unit blocks according to the matching rule of the corresponding data; and an output module is configured to output the extracted data in a document format required by the user. 如申請專利範圍第1項所述之文本解析系統,所述之文本描述定義文檔係一種可擴展標記語言文檔,該文本描述定義文檔採用正則運算式描述各個區塊及各個資料的匹配規則。 The text parsing system according to claim 1, wherein the text description definition document is an extensible markup language document, and the text description defines a matching rule in which the document uses a regular expression to describe each block and each material. 如申請專利範圍第1項所述之文本解析系統,所述之輸出模組將所提取的資料以用戶所需的文檔格式輸出係將所提取的資料根據文本描述定義文檔組織為可擴展標記語言文檔輸出。 The text parsing system according to claim 1, wherein the output module organizes the extracted data in a document format required by the user, and organizes the extracted data according to the text description definition document into an extensible markup language. Document output. 一種文本解析方法,該方法包括步驟:建立文本描述定義文檔,所述之文本描述定義文檔定義各個區塊及各個資料的匹配規則,各個區塊形成多層的樹狀結構,頂層係根區塊,底層係最小單位區塊; 載入文本描述定義文檔及將所需解析的文本載入到陣列中,陣列的一個元素對應文本的一行;從根區塊開始,根據各個區塊的匹配規則在所述陣列中匹配文本的各個區塊,若匹配出的區塊包含子區塊,則根據子區塊的匹配規則在所述匹配出的區塊中匹配所述子區塊,直到匹配出所有區塊,並且根據相應資料的匹配規則從各個最小單位區塊中提取出各個資料;及將所提取的資料以用戶所需的文檔格式輸出。 A text parsing method, the method comprising the steps of: creating a text description definition document, wherein the text description defines a matching rule for defining each block and each material of the document, each block forming a multi-layered tree structure, and the top layer is a root block, The bottom layer is the smallest unit block; Loading the text description defines the document and loads the text to be parsed into the array. One element of the array corresponds to one line of text; starting from the root block, each of the texts is matched in the array according to the matching rule of each block a block, if the matched block includes a sub-block, the sub-block is matched in the matched block according to a matching rule of the sub-block until all the blocks are matched, and according to the corresponding data The matching rule extracts each piece of data from each of the smallest unit blocks; and outputs the extracted data in a document format required by the user. 如申請專利範圍第4項所述之文本解析方法,所述之文本描述定義文檔係一種可擴展標記語言文檔,該文本描述定義文檔採用正則運算式描述各個區塊及各個資料的匹配規則。 The text parsing method according to claim 4, wherein the text description defining document is an extensible markup language document, and the text description defines a matching rule in which the document uses a regular expression to describe each block and each material. 如申請專利範圍第4項所述之文本解析方法,所述將所提取的資料以用戶所需的文檔格式輸出係將所提取的資料根據文本描述定義文檔組織為可擴展標記語言文檔輸出。 The text parsing method according to claim 4, wherein the extracted data is output in a document format required by the user, and the extracted data is organized into an extensible markup language document according to the text description definition document.
TW97104022A 2008-02-01 2008-02-01 System and method for parsing texts TWI414950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW97104022A TWI414950B (en) 2008-02-01 2008-02-01 System and method for parsing texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW97104022A TWI414950B (en) 2008-02-01 2008-02-01 System and method for parsing texts

Publications (2)

Publication Number Publication Date
TW200935252A TW200935252A (en) 2009-08-16
TWI414950B true TWI414950B (en) 2013-11-11

Family

ID=44866526

Family Applications (1)

Application Number Title Priority Date Filing Date
TW97104022A TWI414950B (en) 2008-02-01 2008-02-01 System and method for parsing texts

Country Status (1)

Country Link
TW (1) TWI414950B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020026462A1 (en) * 2000-07-13 2002-02-28 Shotton Charles T. Apparatus for and method of selectively retrieving information and enabling its subsequent display
TW565782B (en) * 2001-05-04 2003-12-11 Ibm Dedicated processor for efficient processing of documents encoded in a markup language
US6880125B2 (en) * 2002-02-21 2005-04-12 Bea Systems, Inc. System and method for XML parsing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020026462A1 (en) * 2000-07-13 2002-02-28 Shotton Charles T. Apparatus for and method of selectively retrieving information and enabling its subsequent display
TW565782B (en) * 2001-05-04 2003-12-11 Ibm Dedicated processor for efficient processing of documents encoded in a markup language
US6880125B2 (en) * 2002-02-21 2005-04-12 Bea Systems, Inc. System and method for XML parsing

Also Published As

Publication number Publication date
TW200935252A (en) 2009-08-16

Similar Documents

Publication Publication Date Title
US20120303645A1 (en) System and method for extraction of structured data from arbitrarily structured composite data
CN101488123B (en) Text resolution system and method
CN103443787B (en) For identifying the system of text relation
US7792814B2 (en) Apparatus and method for parsing unstructured data
US6502112B1 (en) Method in a computing system for comparing XMI-based XML documents for identical contents
US8654125B2 (en) System and method of chart data layout
US8756495B2 (en) Computer-implemented system and method for tagged and rectangular data processing
US8086592B2 (en) Apparatus and method for associating unstructured text with structured data
US20040167870A1 (en) Systems and methods for providing a mixed data integration service
US20060064428A1 (en) Methods and apparatus for mapping a hierarchical data structure to a flat data structure for use in generating a report
US20100325173A1 (en) Rapid development of informatics systems for collaborative data management
TW200506662A (en) Method and system for converting a schema-based hierarchical data structure into a flat data structure
CN101872350A (en) Web page text extracting method and device thereof
Badia SQL for Data Science
EP1745390A2 (en) Data and metadata linking form mechanism and method
CN117236300A (en) PDF-based CRF acquisition table automatic generation method, device and equipment
Ashkpour et al. The aggregate Dutch historical censuses: Harmonization and RDF
TWI414950B (en) System and method for parsing texts
Templ et al. Visualization and imputation of missing values
Cabrera et al. Grawitas: a grammar-based wikipedia talk page parser
KR100666942B1 (en) Method for Handling XML Data Using Relational Database Management System
Aragon et al. Applied epidemiology using R
Hong et al. Extracting web query interfaces based on form structures and semantic similarity
Carver Preparing data for analysis with JMP
Flesca et al. A fuzzy logic approach to wrapping pdf documents

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees