TWI629605B - Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system - Google Patents

Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system Download PDF

Info

Publication number
TWI629605B
TWI629605B TW106107965A TW106107965A TWI629605B TW I629605 B TWI629605 B TW I629605B TW 106107965 A TW106107965 A TW 106107965A TW 106107965 A TW106107965 A TW 106107965A TW I629605 B TWI629605 B TW I629605B
Authority
TW
Taiwan
Prior art keywords
text content
file
type
text
fields
Prior art date
Application number
TW106107965A
Other languages
Chinese (zh)
Other versions
TW201833795A (en
Inventor
呂建林
Original Assignee
新愛世科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新愛世科技股份有限公司 filed Critical 新愛世科技股份有限公司
Priority to TW106107965A priority Critical patent/TWI629605B/en
Application granted granted Critical
Publication of TWI629605B publication Critical patent/TWI629605B/en
Publication of TW201833795A publication Critical patent/TW201833795A/en

Links

Abstract

一種適用於對應使用者的信用記錄之可攜式文件格式檔 案的資料擷取方法。所述方法包括對所述可攜式文件格式檔案進行向量轉換,以使所述可攜式文件格式檔案轉換成向量檔案;根據所述可攜式文件格式檔案的多個層來將所述向量檔案的所述多筆文字內容劃分為第一類型文字內容與第二類型文字內容;對所述向量檔案進行清洗操作,以使所述向量檔案轉換成文字檔案;以及從所述文字檔案的多個章節中的目標章節的多個欄位擷取多筆資料。 A portable file format file suitable for a credit record corresponding to a user The method of data acquisition. The method includes performing vector conversion on the portable file format file to convert the portable file format file into a vector file; and using the vector according to multiple layers of the portable file format file The plurality of text content of the file is divided into a first type of text content and a second type of text content; and the vector file is cleaned to convert the vector file into a text file; and from the text file Multiple fields in the target chapters of the chapters draw multiple pieces of data.

Description

用於對應使用者的信用記錄之可攜式文件格式 檔案的資料擷取方法與個人信用分析系統 Portable file format for corresponding user's credit history File data extraction method and personal credit analysis system

本發明是有關於一種資料擷取方法,且特別是有關於用於對應使用者的信用記錄之可攜式文件格式檔案的資料擷取方法。 The present invention relates to a data extraction method, and more particularly to a data acquisition method for a portable file format file for a credit record corresponding to a user.

隨著科技的進步以及個人金融事業的蓬勃發展,使用者會開始想要藉由網路(如,提供信用評比的網站或是機構)來有效率地查詢個人的信用評比以及對應的資訊。例如,有些機構或是網站會請使用者申請紙本報告,並且利用郵寄紙本報告的方式且以人工方式來進行信用評比。在相關程序逐漸電子化後,目前使用者可先向聯徵中心來下載個人信用報告,並且藉由所下載的個人信用報告的內容來填入對應的資訊至負責信用評比的網站。或是,使用者直接上傳所下載的個人信用報告至負責評比的網站,並且由該網站依據所上傳的個人信用報告來輸入與信用評比 相關的資訊。如此一來,該網站可根據與信用評比相關的資訊來計算該名使用者的信用評比。 With the advancement of technology and the booming of personal finance, users will begin to efficiently search for personal credit ratings and corresponding information through the Internet (eg, websites or institutions that offer credit ratings). For example, some organizations or websites will ask users to apply for a paper report and use the method of mailing a paper report and manually to conduct a credit rating. After the relevant program is gradually electronicized, the user can download the personal credit report to the association center first, and fill in the corresponding information to the website responsible for the credit rating by the content of the downloaded personal credit report. Or, the user directly uploads the downloaded personal credit report to the website responsible for the rating, and the website enters the credit rating according to the uploaded personal credit report. Related information. In this way, the website can calculate the credit rating of the user based on the information related to the credit rating.

然而,上述依據個人信用報告的內容來填入對應的與信用評比相關的資訊的方式較為耗時並且存在輸入錯誤的可能。因此,要如何有效率地從個人信用報告來擷取相關的資訊,成為本領域人員所致力發展的目標。 However, the above-mentioned manner of filling in the corresponding information related to the credit rating according to the content of the personal credit report is time consuming and has the possibility of input errors. Therefore, how to efficiently extract relevant information from personal credit reports has become the goal of the development of personnel in this field.

本發明提供一種資料擷取方法,可有效率地從對應使用者的信用記錄之可攜式文件格式檔案擷取資料。 The invention provides a data extraction method for efficiently extracting data from a portable file format file corresponding to a user's credit record.

本發明的一實施例提供一種資料擷取方法,適用於對應使用者的信用記錄之可攜式文件格式檔案。所述方法包括對所述可攜式文件格式檔案進行向量轉換,以使所述可攜式文件格式檔案轉換成向量檔案,其中所述可攜式文件格式檔案具有分別被配置於多個層的多筆文字內容,其中所述向量檔案的多筆文字內容為所述可攜式文件格式檔案的配置於所述多個層的所述多筆文字內容,並且所述向量檔案的所述多筆文字內容皆可被圈選;根據所述可攜式文件格式檔案的所述多個層來將所述向量檔案的所述多筆文字內容劃分為第一類型文字內容與第二類型文字內容;對所述向量檔案進行清洗操作,以使所述向量檔案轉換成文字檔案;以及從所述文字檔案的屬於該第一類型文字內容的多個章節中的目標章節的多個欄位擷取多筆資料。 An embodiment of the present invention provides a data extraction method, which is applicable to a portable file format file corresponding to a user's credit record. The method includes performing vector conversion on the portable file format file to convert the portable file format file into a vector file, wherein the portable file format file has a plurality of layers respectively configured a plurality of pieces of text content, wherein the plurality of pieces of text content of the vector file are the plurality of pieces of text content of the portable file format file disposed in the plurality of layers, and the plurality of pieces of the vector file The text content may be circled; the plurality of text content of the vector file is divided into the first type of text content and the second type of text content according to the plurality of layers of the portable file format file; Performing a cleaning operation on the vector file to convert the vector file into a text file; and extracting a plurality of fields from a target chapter of the plurality of chapters of the text file belonging to the first type of text content Pen data.

基於上述,本發明的一實施例所提供的資料擷取方法,可將對應使用者的信用記錄之可攜式文件格式檔案轉換成向量檔案,再對所述向量檔案進行清洗操作,以獲得僅含有特定內容的文字檔案,進而有效率地從目標章節來擷取多個欄位所記錄的資料。 Based on the above, the data extraction method provided by an embodiment of the present invention can convert a portable file format file corresponding to a user's credit record into a vector file, and then perform a cleaning operation on the vector file to obtain only A text file containing specific content, which efficiently retrieves data recorded in multiple fields from the target chapter.

為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。 The above described features and advantages of the invention will be apparent from the following description.

100‧‧‧個人信用分析系統 100‧‧‧personal credit analysis system

110‧‧‧向量轉換模組 110‧‧‧Vector Conversion Module

120‧‧‧清洗模組 120‧‧‧cleaning module

130‧‧‧資料擷取模組 130‧‧‧Data Capture Module

140‧‧‧使用者檔案資料庫 140‧‧‧User Profile Database

150‧‧‧信用評比模組 150‧‧‧Credit rating module

200‧‧‧個人信用報告PDF檔案 200‧‧‧Personal Credit Report PDF File

S201、S203、S205、S207‧‧‧資料擷取方法的流程步驟 S201, S203, S205, S207‧‧‧ Process steps of data acquisition method

S2031、S2033‧‧‧步驟S203的流程步驟 S2031, S2033‧‧‧ Process steps of step S203

S2051、S2053、S2055、S2057‧‧‧步驟S205的流程步驟 S2051, S2053, S2055, S2057‧‧ ‧ process steps of step S205

S2071、S2073、S2075‧‧‧步驟S207的流程步驟 S2071, S2073, S2075‧‧‧ Process steps of step S207

S20731、S20733、S20735、S20737、S20739‧‧‧步驟S2073的流程步驟 S20731, S20733, S20735, S20737, S20739‧‧‧Step S2073

701‧‧‧表頭層的文字內容 701‧‧‧ text content at the top of the table

702、703‧‧‧浮水印層的文字內容 702, 703‧‧‧ text content of the watermark layer

801‧‧‧目標章節 801‧‧‧ Target chapter

810‧‧‧目標章節的名稱、目標章節的第一行文字內容 810‧‧‧Name of the target chapter, the first line of text of the target chapter

820‧‧‧目標章節的第二行文字內容 820‧‧‧The second line of text in the target chapter

821‧‧‧目標章節的第三行文字內容 The third line of text of the target section of 821‧‧‧

822‧‧‧目標章節的第四行文字內容 The fourth line of text of the target section of 822‧‧‧

圖1是根據本發明的一實施例所繪示的個人信用分析系統的方塊圖。 1 is a block diagram of a personal credit analysis system in accordance with an embodiment of the present invention.

圖2是根據本發明的一實施例所繪示的資料擷取方法的流程圖。 2 is a flow chart of a data capture method according to an embodiment of the invention.

圖3是根據本發明的一實施例所繪示的圖2的步驟S203的流程圖。 FIG. 3 is a flow chart of step S203 of FIG. 2 according to an embodiment of the invention.

圖4是根據本發明的一實施例所繪示的圖2的步驟S205的流程圖。 FIG. 4 is a flow chart of step S205 of FIG. 2 according to an embodiment of the invention.

圖5是根據本發明的一實施例所繪示的圖2的步驟S207的流程圖。 FIG. 5 is a flow chart of step S207 of FIG. 2 according to an embodiment of the invention.

圖6是根據本發明的一實施例所繪示的圖5的步驟S2073的流程圖。 FIG. 6 is a flow chart of step S2073 of FIG. 5 according to an embodiment of the invention.

圖7是根據本發明的一實施例所繪示的部份的個人信用報告。 7 is a partial personal credit report depicted in accordance with an embodiment of the present invention.

圖8是根據本發明的一實施例所繪示的將從個人信用報告轉換而成的向量檔案清洗後所獲得的文字檔案的一個章節的內容。 FIG. 8 is a diagram showing a chapter of a text file obtained after cleaning a vector file converted from a personal credit report according to an embodiment of the invention.

為了方便說明,假設使用者上傳與該名使用者的信用記錄有關的檔案至一個可提供個人信用評比的網站。所述檔案例如是使用者從財團法人金融聯合徵信中心(或是其他可提供使用者的信用記錄的機構)所下載的個人信用報告檔案,其中所述檔案為可攜式文件格式(Portable Document Format,PDF)。所述網站利用架設於所述網站的伺服器的個人信用分析系統來接收所述檔案,並且可從所接收的檔案來擷取可用以判斷該名使用者的信用評比的相關資料。 For convenience of explanation, assume that the user uploads a file related to the credit history of the user to a website that provides a personal credit rating. The file is, for example, a personal credit report file downloaded by a user from a financial joint credit center of the corporation (or other institution that can provide a user's credit history), wherein the file is a portable document format (Portable Document) Format, PDF). The website receives the file using a personal credit analysis system of a server erected on the website, and can retrieve relevant information from the received file to determine the credit rating of the user.

圖1是根據本發明的一實施例所繪示的個人信用分析系統的方塊圖。 1 is a block diagram of a personal credit analysis system in accordance with an embodiment of the present invention.

請參照圖1,在本實施例中,個人信用分析系統100包括向量轉換模組110、清洗模組120、資料擷取模組130、使用者檔案資料庫140以及信用評比模組150。此外,個人信用分析系統的輸入模組(未繪示)會從使用者來接收個人信用報告可攜式文件格式檔案200(以下亦稱,個人信用報告PDF檔案200)。 Referring to FIG. 1 , in the embodiment, the personal credit analysis system 100 includes a vector conversion module 110 , a cleaning module 120 , a data capture module 130 , a user profile database 140 , and a credit rating module 150 . In addition, the input module (not shown) of the personal credit analysis system receives the personal credit report portable file format file 200 (hereinafter also referred to as the personal credit report PDF file 200) from the user.

在本實施例中,所述個人信用分析系統為經由向量轉換 模組110、清洗模組120、資料擷取模組130、使用者檔案資料庫140以及信用評比模組150等多個程式碼模組所集合而成的一應用程式,並且所述提供個人信用評比的網站的伺服器的處理單元可存取與執行所述個人信用分析系統100。然而,在其他實施例中,所述個人信用分析系統100亦可被實作為可安裝於行動裝置的應用程式(APP)以在行動裝置上執行。或是,所述個人信用分析系統100的多個程式碼模組可利用硬體方式來實作。例如,向量轉換模組110可被實作為具有向量轉換模組110的功能的向量轉換電路單元。以此類推,所有個人信用分析系統的清洗模組120、資料擷取模組130以及信用評比模組150可分別被實作為清洗電路單元、資料擷取電路單元以及信用評比電路單元,並且使用者檔案資料庫140可被儲存於耦接於資料擷取電路單元與信用評比電路單元的儲存電路單元中。 In this embodiment, the personal credit analysis system is converted via a vector An application program of a plurality of code modules, such as a module 110, a cleaning module 120, a data capture module 130, a user profile database 140, and a credit rating module 150, and the personal credit is provided. The processing unit of the server of the rated website can access and execute the personal credit analysis system 100. However, in other embodiments, the personal credit analysis system 100 can also be implemented as an application (APP) that can be installed on a mobile device to execute on the mobile device. Alternatively, the plurality of code modules of the personal credit analysis system 100 can be implemented in a hardware manner. For example, the vector conversion module 110 can be implemented as a vector conversion circuit unit having the function of the vector conversion module 110. By analogy, the cleaning module 120, the data acquisition module 130, and the credit rating module 150 of all the personal credit analysis systems can be implemented as a cleaning circuit unit, a data acquisition circuit unit, and a credit rating circuit unit, respectively, and the user. The archive database 140 can be stored in a storage circuit unit coupled to the data capture circuit unit and the credit rating circuit unit.

以下會同時配合圖1與圖2來說明各個人信用分析系統各模組的功能以及本實施例所提供的資料擷取方法。 The function of each module of each human credit analysis system and the data acquisition method provided by this embodiment will be described below with reference to FIG. 1 and FIG. 2 .

圖2是根據本發明的一實施例所繪示的資料擷取方法的流程圖。請同時參照圖1與圖2,在步驟S201中,向量轉換模組110對可攜式文件格式檔案進行向量轉換,以使所述可攜式文件格式檔案轉換成向量檔案。所述向量檔案例如是可縮放向量圖形(Scalable Vector Graphics,SVG)檔案。應注意的是,本發明並不限定於用以進行向量轉換的方法/程式。 2 is a flow chart of a data capture method according to an embodiment of the invention. Referring to FIG. 1 and FIG. 2 simultaneously, in step S201, the vector conversion module 110 performs vector conversion on the portable file format file to convert the portable file format file into a vector file. The vector file is, for example, a Scalable Vector Graphics (SVG) file. It should be noted that the present invention is not limited to the method/program for performing vector conversion.

更具體來說,當向量轉換模組110接收到使用者所上傳 的個人信用報告PDF檔案200,向量轉換模組110會將所述個人信用報告PDF檔案200轉換為向量檔案。一般來說,個人信用報告PDF檔案200會具有被配置於多個層(Layer)的內容(如,圖像內容或/且文字內容)。所述多個層例如是主體層、浮水印層、表頭層、表尾層。所述主體層具有多個章節(Sections),並且所述多個章節分別用以記錄該名使用者的個人信用報告的多種資訊,並且具有多行文字內容。所述浮水印層記錄特定的圖案或是內容,例如,用以防偽的圖案(如,圖7所繪示的圖案703)、用以表示使用者身份的代號(如,圖7所繪示的代號702)、或是文件編號。此外,所述浮水印層的內容大多會與主體層的文字內容重疊。所述表頭層(如,圖7所繪示的表頭層701)以及表尾層用以記錄所述個人信用報告PDF檔案200中不影響判斷該名使用者的信用評比的其他輔助資訊,例如,頁碼、檔案建立日期與時間、提供信用記錄之機構名稱等等。 More specifically, when the vector conversion module 110 receives the upload by the user The personal credit report PDF file 200, the vector conversion module 110 converts the personal credit report PDF file 200 into a vector file. In general, the personal credit report PDF file 200 will have content (eg, image content or/and text content) that is configured in multiple layers. The plurality of layers are, for example, a body layer, a watermark layer, a header layer, and a tail layer. The main body layer has a plurality of chapters, and the plurality of chapters respectively record various information of the personal credit report of the user, and have a plurality of lines of text content. The watermark layer records a specific pattern or content, for example, a pattern for anti-counterfeiting (such as the pattern 703 shown in FIG. 7), and a code for indicating the identity of the user (eg, as shown in FIG. 7). Code 702), or file number. In addition, the content of the watermark layer mostly overlaps with the text content of the main layer. The header layer (eg, the header layer 701 illustrated in FIG. 7) and the footer layer are used to record other auxiliary information in the personal credit report PDF file 200 that does not affect the credit rating of the user. For example, the page number, the date and time the file was created, the name of the institution that provided the credit history, and so on.

轉換後的向量檔案的多行文字內容為所述個人信用報告PDF檔案200的配置於所述多個層的所述多行文字內容。所述向量檔案的所述多行文字內容皆可被圈選。 The multi-line text content of the converted vector file is the multi-line text content of the personal credit report PDF file 200 disposed in the plurality of layers. The plurality of lines of text content of the vector file may be circled.

向量模組110會將轉換後的向量檔案傳送至清洗模組120。 The vector module 110 transmits the converted vector file to the cleaning module 120.

在步驟S203中,清洗模組120根據所述可攜式文件格式檔案的多個層來將所述向量檔案的多行文字內容劃分為第一類型文字內容與第二類型文字內容。以下會配合圖3來說明詳細的劃 分方法。 In step S203, the cleaning module 120 divides the multi-line text content of the vector file into a first type of text content and a second type of text content according to multiple layers of the portable file format file. The following will be explained in conjunction with Figure 3. Sub-method.

圖3是根據本發明的一實施例所繪示的圖2的步驟S203的流程圖。請參照圖3,在步驟S2031中,清洗模組120將所述向量檔案的所述多行文字內容中對應浮水印層的文字內容、對應表頭層的文字內容以及對應表尾層的文字內容劃分為所述第二類型文字內容。接著,在步驟S2033中,清洗模組120將所述向量檔案的所述多行文字內容中非所述第二類型文字內容的其他文字內容劃分為所述第一類型文字內容。 FIG. 3 is a flow chart of step S203 of FIG. 2 according to an embodiment of the invention. Referring to FIG. 3, in step S2031, the cleaning module 120 compares the text content of the watermark layer corresponding to the watermark layer of the multi-line text content of the vector file, the text content of the corresponding header layer, and the text content of the corresponding footer layer. Divided into the second type of text content. Next, in step S2033, the cleaning module 120 divides other text content of the plurality of lines of text content of the vector file that is not the second type of text content into the first type of text content.

具體來說,清洗模組120會先將向量檔案的所有文字內容中不會用以判斷使用者的信用評比的文字內容(如,在浮水層、表頭層以及表尾層的文字或圖像內容)劃分為第二類型文字內容,並且將用以判斷使用者的信用評比的文字內容(如,主體層的文字內容)劃分為第一類型文字內容。接著,清洗模組120會基於第一類型文字內容來進行進一步的資訊過濾,以擷取可用以判斷使用者信用評比的資料。 Specifically, the cleaning module 120 firstly uses the text content of the text file of the vector file that is not used to determine the user's credit rating (eg, text or image on the floating layer, the header layer, and the foot layer). The content is divided into the second type of text content, and the text content (eg, the text content of the main layer) used to determine the credit rating of the user is divided into the first type of text content. Then, the cleaning module 120 performs further information filtering based on the first type of text content to retrieve data that can be used to determine the user's credit rating.

例如,請再回到圖2,在步驟S205中,清洗模組120會對所述向量檔案進行清洗操作,以使所述向量檔案轉換成文字檔案。以下會藉由圖4來詳細說明清洗操作的其他細節。 For example, please return to FIG. 2. In step S205, the cleaning module 120 performs a cleaning operation on the vector file to convert the vector file into a text file. Further details of the cleaning operation will be described in detail below with reference to FIG.

圖4是根據本發明的一實施例所繪示的圖2的步驟S205的流程圖。請參照圖4,在步驟S2051中,清洗模組120會將所述向量檔案的部份或全部的所述多行文字內容儲存至所述文字檔案的同一層。在步驟S2053中,清洗模組120會從所述文字檔案的 屬於所述第一類型文字內容的多行文字內容中辨識多個章節。更詳細來說,清洗模組120會僅針對第一類型文字內容來進行後續的清洗操作。例如,假設第一類型文字內容為所述個人信用報告PDF檔案200的主體層的所有文字內容。清洗模組120會辨識第一類型文字內容(即,主體層的所有文字內容)所具有的多個章節(Section),並且根據所述多個章節的分隔橫線(如,雙實橫線、虛橫線、單實橫線)來區分每個章節,並且辨識出每個章節的名稱。例如,清洗模組120可利用每個章節的第一行文字內容中的特定字元“【”以及“】”來辨識出每個章節的名稱。舉例來說,從記錄於個人信用報告PDF檔案200的主體層的一個章節的其中一行的文字內容“【信用卡資訊】”可辨識出所述章節的名稱為“信用卡資訊”。 FIG. 4 is a flow chart of step S205 of FIG. 2 according to an embodiment of the invention. Referring to FIG. 4, in step S2051, the cleaning module 120 stores part or all of the plurality of lines of text content of the vector file to the same layer of the text file. In step S2053, the cleaning module 120 will be from the text file. A plurality of chapters are identified in the multi-line text content belonging to the first type of text content. In more detail, the cleaning module 120 will perform subsequent cleaning operations only for the first type of text content. For example, assume that the first type of text content is all of the text content of the body layer of the personal credit report PDF file 200. The cleaning module 120 recognizes a plurality of sections of the first type of text content (ie, all text content of the main layer), and according to the separated horizontal lines of the plurality of chapters (eg, double solid lines, Dash lines, single solid lines) to distinguish each chapter, and identify the name of each chapter. For example, the cleaning module 120 can identify the name of each chapter by using the specific characters "[" and "] in the first line of text content of each chapter. For example, the text content "[Credit Card Information] recorded in one of the sections of the main layer of the personal credit report PDF file 200 can be identified as the "credit card information".

接著,在步驟S2055中,清洗模組120會根據所述該第一類型文字內容的所述多個章節的類型將所述多個章節劃分為不具有多個欄位的第一型章節與具有多個欄位的第二型章節。所述第一型章節和所述第二型章節被用以判斷使用者的信用評比。具體來說,清洗模組120會儲存預定規則表,其中所述預定規則會記錄所有可能出現在所述個人信用報告PDF檔案200中的章節名稱、對應每個章節名稱的欄位總數目,以及對應每個章節名稱的每個欄位(Fields)的字元型態。應注意的是,所述預定規則表是根據個人信用報告PDF檔案200的格式來制定。 Then, in step S2055, the cleaning module 120 divides the plurality of chapters into first-type chapters having no multiple fields according to the types of the plurality of chapters of the first type of text content, and has The second type of chapter for multiple fields. The first type of chapter and the second type of chapter are used to determine a user's credit rating. Specifically, the cleaning module 120 stores a predetermined rule table, wherein the predetermined rule records all chapter names that may appear in the personal credit report PDF file 200, the total number of fields corresponding to each chapter name, and The character type of each field corresponding to each chapter name. It should be noted that the predetermined rule table is formulated in accordance with the format of the personal credit report PDF file 200.

在另一實施例中,清洗模組120可直接根據記錄章節名 稱的文字內容的下一行的文字內容來辨識對應每個章節名稱的多個欄位的欄位名稱、欄位總數目,以及字元型態。 In another embodiment, the cleaning module 120 can directly record the chapter name. The text content of the next line of the text content is identified to identify the field name, the total number of fields, and the character type of the plurality of fields corresponding to each chapter name.

接著,在步驟S2057中,清洗模組120會根據所述第一型章節與所述第二型章節的多個章節的名稱,對應地標記所述第一型章節與所述第二型章節的所述多個章節的部份的多行文字內容,其中所述第一型章節與所述第二型章節的所述多個章節中不被標記的其他部份的多行文字內容為分隔橫線。具體來說,步驟S2057可視為清洗模組120僅標記被用以判斷使用者的信用評比的屬於第一類型文字內容的第一型章節與第二型章節至文字檔案,並且不標記其中的分隔橫線(或是不標記其他不被用以判斷使用者的信用評比的文字內容)。所述被標記的所有文字內容都會在爾後被選擇,進而擷取出對應的資料。換句話說,可視為,清洗模組120所執行的清洗操作可(經由不標記的方式)讓浮水層、表頭層、表尾層的文字內容與其他不被用以判斷使用者的信用評比的文字內容(如,分隔橫線)都不會被擷取。 Next, in step S2057, the cleaning module 120 correspondingly marks the first type chapter and the second type chapter according to the names of the first type chapter and the plurality of chapters of the second type chapter. a plurality of lines of text content of a portion of the plurality of chapters, wherein the first type of chapters and the plurality of lines of the plurality of chapters of the second type of chapters that are not marked are separated by a plurality of lines of text content line. Specifically, step S2057 can be regarded as the cleaning module 120 only marking the first type and the second type of the text belonging to the first type of text content used to determine the user's credit rating, and not marking the separation thereof. Horizontal lines (or do not mark other text content that is not used to judge the user's credit rating). All of the marked text content will be selected later, and the corresponding data will be extracted. In other words, it can be seen that the cleaning operation performed by the cleaning module 120 can (through a non-marking manner) let the text content of the floating layer, the header layer, and the tail layer be compared with other credits that are not used to judge the user. The text content (eg, the horizontal line) will not be captured.

在本實施例中,所述文字檔案的所述多行文字內容可依照所述個人信用報告PDF檔案200的所述多個層來依序排列。舉例來說,所述文字檔案的所述多行文字內容會依序記錄所述個人信用報告PDF檔案200的表頭層、表尾層、浮水層、主體層的多行文字內容於向量檔案中,並且所述文字檔案的多行文字內容皆被配置於同一層(不會互相重疊)。應注意的是,所述文字檔案中對應所述多個層的多行文字內容的排列順序並不限於本實施例。 例如,在另一實施例中,所述文字檔案的所述多行文字內容亦可依序記錄所述個人信用報告PDF檔案200的浮水層、表頭層、表尾層、主體層的多行文字內容於文字檔案中。 In this embodiment, the plurality of lines of text content of the text file may be sequentially arranged according to the plurality of layers of the personal credit report PDF file 200. For example, the multi-line text content of the text file sequentially records the multi-line text content of the header layer, the foot layer, the floating layer, and the main layer of the personal credit report PDF file 200 in the vector file. And the multi-line text content of the text file is configured on the same layer (does not overlap each other). It should be noted that the order in which the plurality of lines of text content corresponding to the plurality of layers in the text file are not limited to the embodiment. For example, in another embodiment, the plurality of lines of text content of the text file may also sequentially record multiple rows of the floating layer, the header layer, the footer layer, and the main layer of the personal credit report PDF file 200. The text is in a text file.

請再回到圖2,在步驟S207中,資料擷取模組130會從所述文字檔案的屬於所述第一類型文字內容的多個章節中的目標章節的多個欄位擷取所述多個欄位所記錄的多筆資料。以下會配合圖5來詳細說明步驟S207的細節。 Referring back to FIG. 2, in step S207, the data retrieval module 130 may extract the plurality of fields from the target chapters of the plurality of chapters of the first type of text content of the text file. Multiple pieces of data recorded in multiple fields. The details of step S207 will be described in detail below with reference to FIG.

圖5是根據本發明的一實施例所繪示的圖2的步驟S207的流程圖。請參照圖5,在步驟S2071中,資料擷取模組130會從所述文字檔案中選擇所述目標章節的被標記的多行文字內容。換句話說,資料擷取模組130僅會從之前被標記的文字內容來進行文字內容選擇。如此一來,可避免擷取到無用的資訊。所述目標章節用以表示目前欲被擷取資料的章節。選擇目標章節的方式可根據所述多個章節的排列順序來依序選取,或是根據對應信用評比模組150或是使用者檔案資料庫140的特定規則來選擇,本發明並不限於此。 FIG. 5 is a flow chart of step S207 of FIG. 2 according to an embodiment of the invention. Referring to FIG. 5, in step S2071, the data capture module 130 selects the marked plurality of lines of text content of the target chapter from the text file. In other words, the data capture module 130 only selects the text content from the previously marked text content. In this way, you can avoid using useless information. The target chapter is used to indicate the chapter that is currently being retrieved. The manner of selecting the target chapter may be selected according to the order of the plurality of chapters, or may be selected according to a specific rule of the corresponding credit rating module 150 or the user profile database 140, and the present invention is not limited thereto.

在選擇了欲擷取資料的目標章節的被標記的多行文字內容後,資料擷取模組130會根據目標章節屬於第一型章節或是第二型章節來進行後續的擷取資料的操作。換句話說,資料擷取模組130會根據目標章節是否具有多個欄位而採用不同的擷取資料的操作。 After selecting the marked multi-line text content of the target chapter of the data to be retrieved, the data capture module 130 performs the subsequent data capture operation according to the target chapter belonging to the first type chapter or the second type chapter. . In other words, the data capture module 130 may use different operations for capturing data according to whether the target chapter has multiple fields.

例如,在步驟S2073中,若所述目標章節屬於第二型章 節,資料擷取模組130會根據所述目標章節的被選擇的所述多行文字內容來分別將被選擇的所述多行的文字內容的每一行文字內容區分為多個欄位,以從所述多個欄位分別擷取所述多個欄位所記錄的所述多筆資料。屬於同一欄位的多個行中的第一行所記錄的文字內容為該同一欄位的欄位名稱,其中屬於同一欄位的該些行中的其他行所記錄的文字內容為對應該同一欄位的該欄位名稱的資料。 For example, in step S2073, if the target chapter belongs to the second type chapter The data capture module 130 separately divides each line of text content of the selected plurality of lines of text content into a plurality of fields according to the selected plurality of lines of text content of the target chapter, Extracting the plurality of pieces of data recorded by the plurality of fields from the plurality of fields. The text content recorded in the first row of the plurality of rows belonging to the same field is the column name of the same field, and the text content recorded by the other rows in the rows belonging to the same field is the same The name of the field name for the field.

又例如,在步驟S2075中,若所述目標章節屬於第一型章節,資料擷取模組130會從被選擇的多行文字內容中辨識所述目標章節的名稱,將為所述目標章節的所述名稱的文字內容辨識為章節名稱欄位,將被選擇的所述多行文字內容的其他文字內容辨識為對應所述章節名稱欄位的章節內容資料,並且擷取所述章節內容資料。舉例來說,圖7中的“【銀行借款資訊】查資料庫中無台端105年04月底在國內各金融機構借款餘額”為目標章節的文字內容。資料擷取模組130會判定此目標章節屬於第一型章節,資料擷取模組130不會去對此目標章節的文字內容區分多個欄位。資料擷取模組130會辨識章節名稱欄位為“【銀行借款資訊】”,並且擷取對應章節名稱欄位的章節內容資料“查資料庫中無台端105年04月底在國內各金融機構借款餘額”。 For another example, in step S2075, if the target chapter belongs to the first type chapter, the data capture module 130 identifies the name of the target chapter from the selected plurality of lines of text content, which will be the target chapter. The text content of the name is identified as a chapter name field, and the other text content of the selected multi-line text content is recognized as a chapter content material corresponding to the chapter name field, and the chapter content material is retrieved. For example, in the "[Bank Loan Information] check database in Figure 7, there is no Taiwanese end of the end of April, 105, the balance of loans in various financial institutions in the country" is the text of the target chapter. The data capture module 130 determines that the target chapter belongs to the first type of chapter, and the data capture module 130 does not distinguish between multiple fields of the text content of the target chapter. The data capture module 130 will identify the chapter name field as "[bank loan information]", and retrieve the chapter content data corresponding to the chapter name field. "The database is not in the database at the end of April, 105. Balance".

以下會配合圖6、圖7、圖8來詳細說明步驟S2073的細節。 The details of step S2073 will be described in detail below with reference to FIGS. 6, 7, and 8.

圖6是根據本發明的一實施例所繪示的圖5的步驟S2073 的流程圖。圖7是根據本發明的一實施例所繪示的部份的個人信用報告。圖8是根據本發明的一實施例所繪示的將從個人信用報告轉換而成的向量檔案清洗後所獲得的文字檔案的一個章節的內容。 FIG. 6 is a diagram S2073 of FIG. 5 according to an embodiment of the invention. Flow chart. 7 is a partial personal credit report depicted in accordance with an embodiment of the present invention. FIG. 8 is a diagram showing a chapter of a text file obtained after cleaning a vector file converted from a personal credit report according to an embodiment of the invention.

請同時參照圖6、圖7、圖8。在本實施例中,假設轉換後的文字檔案的多個章節中的目標章節801為章節“信用卡資訊”。從圖8可看到,目標章節801具有四行文字內容,分別為第一行810的“【信用卡資訊】”、第二行820的“發卡機構 卡名 額度 發卡日期 停用日期 使用狀態”、第三行821的“元大銀行 VISA 普卡 64 105/02/15 112/02/15 使用中”以及第四行822的“元大銀行 VISA 白金卡 100 106/01/03 115/01/03 使用中”。所述每一行的內容之間都間隔一個空格。 Please refer to FIG. 6, FIG. 7, and FIG. 8 at the same time. In the present embodiment, it is assumed that the target chapter 801 in the plurality of chapters of the converted text file is the chapter "credit card information". As can be seen from FIG. 8, the target chapter 801 has four lines of text content, which are "[credit card information]" of the first line 810, and "the card issuer card name quota card issue date deactivation date use status" of the second line 820, In the third line 821, "Yuanda Bank VISA Puka 64 105/02/15 112/02/15 in use" and the fourth line 822 "Yuanda Bank VISA Platinum Card 100 106/01/03 115/01/03 Using". There is a space between each line of the content.

在步驟S20731中,資料擷取模組130會根據被選擇的所述多行文字內容的第一行文字內容來辨識該目標章節的該名稱。如上所述,資料擷取模組130可利用特定字元(如,“【”與“】”)來辨識出位於目標章節的所述多行文字內容的第一行文字內容中的章節名稱。 In step S20731, the data capture module 130 identifies the name of the target chapter based on the first line of text content of the selected plurality of lines of text content. As described above, the data capture module 130 can utilize a particular character (eg, "[" and "]") to identify the chapter name in the first line of text content of the multi-line text content of the target chapter.

在步驟S20733中,資料擷取模組130會根據所述目標章節的所述名稱來從預定規則表查詢所述目標章節的欄位總數目,或根據被選擇的所述多行文字內容的第二行文字內容來辨識所述欄位總數目,其中根據被選擇的所述多行文字內容的第二行文字內容用以表示所述目標章節的多個欄位名稱,其中每一所述多個 欄位名稱之間具有分隔字元。具體來說,資料擷取模組130,如上所述,可辨識出目標章節801的名稱為“信用卡資訊”,並且根據“信用卡資訊”來從預定規則表中查詢到,對應“信用卡資訊”的章節的欄位總數目為“6”。即,資料擷取模組130可辨識出目標章節除了第一行之外的其他行的文字內容可被區分為6個欄位(即,發卡機構、卡名、額度、發卡日期、停用日期、使用狀態共6個欄位)。然而,在另一實施例中,資料擷取模組130亦可不去查詢預定規則表,但資料擷取模組130直接辨識在章節名稱下一行的文字內容(如,第二行文字內容)為多個欄位的欄位名稱,並且所述多個欄位名稱之間都具有一個(如,空格)分隔字元。舉例來說,目標章節的第二行文字內容為“發卡機構、卡名、額度、發卡日期、停用日期、使用狀態”,其中“發卡機構”、“卡名”、“額度”、“發卡日期”、“停用日期”、“使用狀態”為欄位名稱,並且間隔一個空格。資料擷取模組130根據這些欄位名稱的數量來判定目標章節具有6個欄位,即,資料擷取模組130會判定目標章節的欄位總數目為“6”。在一實施例中,資料擷取模組130可將第二行文字內容的分隔字元的總數量加上1來作為目標章節的欄位總數目。 In step S20733, the data capture module 130 queries the predetermined rule table for the total number of fields of the target chapter according to the name of the target chapter, or according to the selected plurality of lines of text content. Two lines of text content to identify the total number of the fields, wherein the second line of text content according to the selected plurality of lines of text content is used to represent a plurality of field names of the target chapter, wherein each of the plurality of fields One There is a separator character between the field names. Specifically, the data capture module 130, as described above, can recognize that the name of the target chapter 801 is “credit card information”, and is queried from the predetermined rule table according to “credit card information”, corresponding to “credit card information”. The total number of columns in the chapter is "6". That is, the data capture module 130 can recognize that the text content of the line other than the first line of the target chapter can be divided into six fields (ie, the card issuing institution, the card name, the credit amount, the card issuing date, and the expiration date). There are 6 fields in the use status). However, in another embodiment, the data capture module 130 may not query the predetermined rule table, but the data capture module 130 directly recognizes the text content of the next line of the chapter name (eg, the second line of text content). A field name for multiple fields, and each of the multiple field names has a (eg, space) separator character. For example, the second line of text of the target chapter is “issuing institution, card name, credit, date of issuance, date of expiration, status of use”, where “issuing agency”, “card name”, “credit”, “issuing card” Date, Deactivation Date, Usage Status are field names and are separated by a space. The data capture module 130 determines that the target chapter has six fields based on the number of the field names. That is, the data capture module 130 determines that the total number of fields in the target chapter is "6". In an embodiment, the data capture module 130 may add 1 to the total number of separator characters of the second line of text content as the total number of fields of the target chapter.

在步驟S20735中,資料擷取模組130會根據每一所述多個欄位名稱來辨識對應每一所述多個欄位的字元型態。舉例來說,資料擷取模組130根據預定規則表辨識到,對應為“發卡機構”的欄位名稱,此欄位所記錄的資料的字元型態為中文字元;對應為“卡名”的欄位名稱,此欄位所記錄的資料的字元型態為中文、 英文字元或是空格;對應為“額度”的欄位名稱,此欄位所記錄的資料的字元型態為數字字元;對應為“發卡日期”的欄位名稱,此欄位所記錄的資料的字元型態為數字字元或斜線字元;對應為“停用日期”的欄位名稱,此欄位所記錄的資料的字元型態為數字字元或斜線字元;對應為“使用狀態”的欄位名稱,此欄位所記錄的資料的字元型態為中文字元。 In step S20735, the data capture module 130 identifies the character type corresponding to each of the plurality of fields according to each of the plurality of field names. For example, the data capture module 130 identifies, according to the predetermined rule table, a field name corresponding to the “issuing institution”, and the character type of the data recorded in the field is a Chinese character; the corresponding is the “card name”. The name of the field, the character type of the data recorded in this field is Chinese, English character or space; corresponding to the name of the "number" field, the character type of the data recorded in this field is a numeric character; the field name corresponding to the "issuing date", recorded in this field The character type of the data is a numeric character or a slash character; the field name corresponding to the "deactivation date", and the character type of the data recorded in the field is a numeric character or a slash character; For the field name of "Usage Status", the character type of the data recorded in this field is Chinese character.

值得一提的是,在本實施例中,可允許對應部份欄位的資料為空白資料,即,部份欄位並不會記錄任何資料。所述空白資料亦會被記錄至使用者檔案資料庫。 It is worth mentioning that in this embodiment, the data corresponding to the partial fields may be allowed to be blank data, that is, some fields do not record any data. The blank data will also be recorded to the user profile database.

接著,在步驟S20737中,資料擷取模組130根據分隔字元將被選擇的所述多行文字內容的第三行文字內容至最後一行文字內容中的每一行文字內容分別區分為多個欄位,以使所述每一行文字內容的被區分的所述多個欄位的數目等於所述目標章節的所述欄位總數目,並且使所述每一行文字內容的被區分的所述多個欄位的資料符合所述資料所屬的欄位的字元型態。 Next, in step S20737, the data capture module 130 divides the third line of text content of the selected plurality of lines of text content into each line of the text content of the last line of text content into a plurality of columns according to the separation character. a bit such that the number of the plurality of fields distinguished by the line of text is equal to the total number of the fields of the target chapter, and the plurality of lines of the text content are differentiated The data of the fields is in accordance with the character type of the field to which the data belongs.

具體來說,在本實施例中,資料擷取模組130亦會利用分隔字元(如,空格)來區分目標章節801的其他行(如第三行至最後一行)的多個欄位,並且將區分後的欄位以所辨識到的欄位總數目與對應每一個欄位的字元型態來做檢查,以確保所區分後的欄位是否符合目標章節的欄位總數目或對應目標章節的每個欄位的字元型態。 Specifically, in this embodiment, the data capture module 130 also uses a separator character (eg, a space) to distinguish multiple fields of other rows of the target chapter 801 (eg, the third row to the last row). And the differentiated field is checked by the total number of identified fields and the character type corresponding to each field to ensure whether the differentiated field meets the total number of fields in the target chapter or corresponds to The character type of each field in the target section.

舉例來說,請參照圖8,資料擷取模組130會根據預定規 則表,辨識出目標章節801具有多個欄位(即,第二型章節)。接著,資料擷取模組130會對目標章節的第1行之外的其他行(如,第二行820、第三行821、第四行822)的文字內容來進行區分欄位的操作,並且資料擷取模組130可辨識第二行820為多個欄位名稱,第三行821與第四行822為對應多個欄位的資料。 For example, referring to FIG. 8, the data capture module 130 will be according to a predetermined rule. Then, the table identifies that the target chapter 801 has multiple fields (ie, the second type of chapter). Then, the data capture module 130 performs a field operation on the text content of other lines (eg, the second line 820, the third line 821, and the fourth line 822) other than the first line of the target chapter. The data capture module 130 can recognize that the second row 820 is a plurality of field names, and the third row 821 and the fourth row 822 are data corresponding to the plurality of fields.

例如,經由為空格的分隔字元,資料擷取模組130會對於第二行820的文字內容“發卡機構 卡名 額度 發卡日期 停用日期 使用狀態”區分出6個欄位。資料擷取模組130會比對目標章節的欄位總數目與區分後的多個欄位的數目是否一致。在此例子中,由於目標章節的欄位總數目(即,6)與區分後的欄位的數目(即,6)一致,資料擷取模組130會判定對於第二行820的文字內容的區分欄位的操作成功。此外,資料擷取模組130會判定目標章節的6個欄位的欄位名稱分別為“發卡機構”、“卡名”、“額度”、“發卡日期”、“停用日期”、與“使用狀態”,並且,如上所述,經由預定規則表,資料擷取模組130可獲得分別對應“發卡機構”、“卡名”、“額度”、“發卡日期”、“停用日期”、與“使用狀態”欄位的字元型態。 For example, via the separator character that is a space, the data capture module 130 distinguishes six fields for the text content of the second line 820, "card issuer card name credit card issue date deactivation date usage status." The data capture module 130 compares the total number of fields in the target chapter with the number of divided fields. In this example, since the total number of fields of the target chapter (ie, 6) coincides with the number of differentiated fields (ie, 6), the data capture module 130 determines the text content for the second row 820. The operation to distinguish the fields is successful. In addition, the data capture module 130 determines that the field names of the six fields of the target chapter are "issuing institution", "card name", "amount", "issuing date", "deactivation date", and " The usage status", and, as described above, the data retrieval module 130 can obtain the corresponding "issuing institution", "card name", "credit", "issuing date", "deactivation date", respectively, via the predetermined rule table. The character type with the "Use Status" field.

接著,經由為空格的分隔字元,資料擷取模組130會對於第三行820的文字內容“元大銀行 VISA 普卡 64 105/02/15 112/02/15 使用中”區分出7個欄位,所述7個欄位的文字內容分別為“元大銀行”、“VISA”、“普卡”、“64”、“105/02/15”、“112/02/15”、與“使用中”。在此例子中,由於目標章節的欄位總 數目(即,6)與區分後的欄位的數目(即,7)不一致,資料擷取模組130會判定對於第二行820的文字內容的區分欄位的操作失敗,並且會去根據每一欄位所對應的字元型態來判斷如何將7個欄位劃分成6個欄位。 Then, the data capture module 130 distinguishes 7 texts of the third line 820 by the text content "Yuanda Bank VISA Puka 64 105/02/15 112/02/15 in use". In the field, the text contents of the seven fields are “Yuanda Bank”, “VISA”, “Puka”, “64”, “105/02/15”, “112/02/15”, and "Using". In this example, due to the total number of fields in the target chapter The number (ie, 6) is inconsistent with the number of differentiated fields (ie, 7), and the data capture module 130 determines that the operation of the distinguishing field for the text content of the second line 820 fails, and will go according to each The character type corresponding to a field is used to determine how to divide 7 fields into 6 fields.

例如,在第二行820中的第3個欄位是“額度”,並且對應的字元型態是數字字元。但是,在第三行821中所區分的第3個欄位是為中文字元“普卡”,其不符合其所屬欄位(第3個欄位,“額度”)的字元型態(數字字元),但是符合第二行820的第2欄的字元型態(即,中文字元)。並且,在第三行821中所區分的第4個欄位是為數字字元“64”,符合數字字元的字元型態。基於上述比對,資料擷取模組130會嘗試將第三行821的第3欄的資料“普卡”併入至第2欄的資料“VISA”,並且繼續經由預定規則表來比對後續欄位所記錄的資料的字元型態是否符合應對應的字元型態。 For example, the third field in the second row 820 is "amount" and the corresponding character type is a numeric character. However, the third field distinguished in the third row 821 is the Chinese character "Puka", which does not match the character type of the field to which it belongs (the third field, "the amount") ( The numeric character), but conforms to the character type of the second column of the second row 820 (ie, the Chinese character). Also, the fourth field distinguished in the third row 821 is a numeric character "64" that conforms to the character type of the numeric character. Based on the above comparison, the data capture module 130 will attempt to merge the data "Puka" in the third column of the third row 821 into the data "VISA" in the second column, and continue to compare the subsequent rules through the predetermined rule table. Whether the character type of the data recorded in the field conforms to the corresponding character type.

在此例子中,由於在併入“普卡”至“VISA”以成為同一個欄位後(即,第2欄的資料為“VISA普卡”),第三行821的欄位分別為“元大銀行”、“VISA普卡”、“64”、“105/02/15”、“112/02/15”、與“使用中”共6個欄位,其與欄位總數目一致,並且第三行821的每個欄位所記錄的資料的字元型態也符合了其所屬的欄位所對應的字元型態(例如,第3欄的資料“64”符合其所屬欄位“額度”的字元型態-數字字元)。因此,資料擷取模組130會判定對於第三行821的區分欄位的操作是成功的。接著,資料擷取模組130會繼續對第四行822的文字內容來區分欄位,詳細 方式如上所述,不再贅述於此。 In this example, since the "Puka" to "VISA" is merged to become the same field (ie, the data in the second column is "VISA Puka"), the third line 821 has the fields " Yuanda Bank, "VISA Puka", "64", "105/02/15", "112/02/15", and "in use" have 6 fields, which are consistent with the total number of fields. And the character type of the data recorded in each field of the third row 821 also conforms to the character type corresponding to the field to which it belongs (for example, the data "64" in the third column matches the field to which it belongs. The character type of the "number" - the numeric character). Therefore, the data capture module 130 determines that the operation of the distinguishing field for the third row 821 is successful. Then, the data capture module 130 continues to distinguish the fields from the text content of the fourth line 822. The method is as described above and will not be described again.

區分完所有目標章節的欄位後,在步驟S20739中,資料擷取模組130從區分的所述多個欄位分別擷取所述多個欄位所記錄的所述多筆資料。舉例來說,資料擷取模組130會讀取分別對應第二行820的多個欄位“發卡機構”、“卡名”、“額度”、“發卡日期”、“停用日期”、與“使用狀態”的於第三行821所記錄的多筆資料“元大銀行”、“VISA普卡”、“64”、“105/02/15”、“112/02/15”、與“使用中”。此外,資料擷取模組130可根據所述多筆資料分別所屬的所述多個欄位名稱將所擷取的多筆資料記錄至使用者檔案資料庫140中的對應該名使用者的使用者檔案中,以讓信用評比模組150可根據已儲存所述多筆資料的使用者檔案來計算該名使用者的信用評比。 After the fields of all the target chapters are distinguished, in step S20739, the data capturing module 130 extracts the plurality of pieces of data recorded by the plurality of fields from the plurality of divided fields. For example, the data capture module 130 reads a plurality of fields corresponding to the second row 820, "issuing institution", "card name", "amount", "issuing date", "deactivation date", and The "Usage Status" of the multiple records in the third line 821 "Yuanda Bank", "VISA Puka", "64", "105/02/15", "112/02/15", and " Using". In addition, the data capture module 130 may record the captured multiple data to the corresponding user in the user profile database 140 according to the plurality of field names to which the plurality of data belong respectively. In the file, the credit rating module 150 can calculate the credit rating of the user based on the user profile in which the plurality of data has been stored.

應注意的是,根據本發明的上述方法可實現在硬體、韌體中,或者可實現為可儲存在記錄介質(諸如CD ROM、RAM、軟碟、硬碟或磁光碟)中的軟體或電腦代碼或者透過網路下載並儲存在非暫態機器可讀介質上的軟體或電腦代碼,從而在此描述的方法可實施在這樣的使用通用電腦的軟體中或者專用處理器或者可編程或專用硬體(諸如ASIC或FPGA)中。在本領域中具有通常知識者應理解,電腦、處理器、微處理器控制器或可編程硬體包括可儲存或接收軟體或電腦代碼的儲存元件(例如,RAM、ROM、快閃記憶體等),當所述軟體或電腦代碼被電腦訪問和執行時,處理器或硬體實現在此描述的處理方法。另外,應注意的是, 當通用電腦訪問用於實現在此說明的處理的代碼時,代碼的執行將通用電腦變換為用於執行在此說明的處理的專用電腦。 It should be noted that the above method according to the present invention may be implemented in a hardware, a firmware, or may be implemented as a software that can be stored in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk or Computer code or software or computer code downloaded over a network and stored on a non-transitory machine readable medium, such that the methods described herein can be implemented in such software using a general purpose computer or in a dedicated processor or programmable or dedicated In hardware (such as ASIC or FPGA). It should be understood by those of ordinary skill in the art that a computer, processor, microprocessor controller or programmable hardware includes storage elements (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code. The processor or hardware implements the processing methods described herein when the software or computer code is accessed and executed by a computer. In addition, it should be noted that When a general purpose computer accesses code for implementing the processing described herein, execution of the code transforms the general purpose computer into a dedicated computer for performing the processing described herein.

綜上所述,本發明的一實施例所提供的資料擷取方法,可將對應使用者的信用記錄之可攜式文件格式檔案轉換成向量檔案,再對所述向量檔案進行清洗操作,以獲得僅含有特定內容的文字檔案,進而有效率地從目標章節來擷取多個欄位所記錄的資料。 In summary, the data extraction method provided by an embodiment of the present invention can convert a portable file format file corresponding to a user's credit record into a vector file, and then perform a cleaning operation on the vector file to Obtain a text file containing only specific content, and efficiently retrieve the data recorded in multiple fields from the target chapter.

雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention, and any one of ordinary skill in the art can make some changes and refinements without departing from the spirit and scope of the present invention. The scope of the invention is defined by the scope of the appended claims.

Claims (9)

一種資料擷取方法,適用於對應一使用者的一信用記錄之一可攜式文件格式(PDF)檔案,所述方法包括:對該可攜式文件格式檔案進行一向量轉換,以使該可攜式文件格式檔案轉換成一向量檔案,其中該可攜式文件格式檔案具有分別被配置於多個層的多行文字內容,其中該向量檔案的多行文字內容為該可攜式文件格式檔案的配置於該些層的所述多行文字內容,並且該向量檔案的所述多行文字內容皆可被圈選;根據該可攜式文件格式檔案的該些層來將該向量檔案的所述多行文字內容劃分為一第一類型文字內容與一第二類型文字內容;對該向量檔案進行一清洗操作,以使該向量檔案轉換成一文字檔案;以及從該文字檔案的屬於該第一類型文字內容的多個章節中的一目標章節的多個欄位擷取該些欄位所記錄的多筆資料,其中上述對該向量檔案進行該清洗操作,以使該向量檔案轉換成該文字檔案的步驟包括:將該向量檔案的部份或全部的所述多行文字內容儲存至該文字檔案的同一層;從該文字檔案的屬於該第一類型文字內容的多行文字內容中辨識多個章節;將所述多個章節劃分為不具有多個欄位的一第一型章節 與具有多個欄位的一第二型章節;以及根據該第一型章節與該第二型章節的多個章節的名稱,對應地標記該第一型章節與該第二型章節的該些章節的部份的多行文字內容,其中該第一型章節與該第二型章節的該些章節中不被標記的其他部份的多行文字內容為分隔橫線。 A data capture method is applicable to a portable file format (PDF) file corresponding to a credit record of a user, the method comprising: performing a vector conversion on the portable file format file to enable the The portable file format file is converted into a vector file, wherein the portable file format file has multiple lines of text content respectively configured in multiple layers, wherein the multi-line text content of the vector file is the portable file format file Configuring the plurality of lines of text content of the layers, and the plurality of lines of text content of the vector file may be circled; according to the layers of the portable file format file, the vector file is The multi-line text content is divided into a first type of text content and a second type of text content; a cleaning operation is performed on the vector file to convert the vector file into a text file; and the first type from the text file belongs to the first type a plurality of fields of a target chapter of the plurality of chapters of the text content capture a plurality of data recorded by the fields, wherein the cleaning operation is performed on the vector file The step of converting the vector file into the text file comprises: storing part or all of the plurality of lines of text content of the vector file to a same layer of the text file; from the text file belonging to the first type Identifying a plurality of chapters in a plurality of lines of text content; dividing the plurality of chapters into a first type chapter having no plurality of fields And a second type of chapter having a plurality of fields; and correspondingly marking the first type of chapters and the second type of chapters according to the names of the plurality of sections of the first type of chapters and the second type of chapters A multi-line text content of a portion of the chapter, wherein the first type of chapter is separated from the multi-line text content of the other parts of the second type of chapters that are not marked. 如申請專利範圍第1項所述的資料擷取方法,其中該向量檔案為一可縮放向量圖形(SVG)檔案。 The data extraction method of claim 1, wherein the vector file is a scalable vector graphics (SVG) file. 如申請專利範圍第1項所述的資料擷取方法,其中該可攜式文件格式檔案的該些層包括:一主體層;一浮水印層,其中該浮水印層的文字內容與該主體層的文字內容彼此重疊;一表頭層;以及一表尾層。 The data extraction method of claim 1, wherein the layers of the portable file format file comprise: a main layer; a watermark layer, wherein the text content of the watermark layer and the main layer The text content overlaps each other; a header layer; and a tail layer. 如申請專利範圍第3項所述的資料擷取方法,其中上述根據該可攜式文件格式檔案的該些層來將該向量檔案的所述多行文字內容劃分為該第一類型文字內容與該第二類型文字內容的步驟包括:將該向量檔案的所述多行文字內容中對應該浮水印層的文字內容、對應該表頭層的文字內容以及對應該表尾層的文字內容劃分為該第二類型文字內容;以及將該向量檔案的所述多行文字內容中非該第二類型文字內容 的其他文字內容劃分為該第一類型文字內容。 The data extraction method of claim 3, wherein the plurality of lines of text content of the vector file are divided into the first type of text content according to the layers of the portable file format file. The step of the second type of text content includes: dividing the text content corresponding to the watermark layer in the multi-line text content of the vector file, the text content corresponding to the header layer, and the text content corresponding to the footer layer into The second type of text content; and the plurality of lines of text content of the vector file are not the second type of text content The other text content is divided into the first type of text content. 如申請專利範圍第1項所述的資料擷取方法,其中上述從該文字檔案的屬於該第一類型文字內容的該些章節中的該目標章節的該些欄位擷取該些欄位所記錄的所述多筆資料的步驟包括:從該文字檔案中選擇該目標章節的被標記的多行文字內容;若該目標章節屬於該第一型章節,從被選擇的多行文字內容中辨識該目標章節的名稱,將為該目標章節的該名稱的文字內容辨識為一章節名稱欄位,將被選擇的所述多行文字內容的其他文字內容辨識為對應該章節名稱欄位的一章節內容資料,並且擷取該章節內容資料;以及若該目標章節屬於該第二型章節,根據該目標章節的被選擇的所述多行文字內容來分別將被選擇的所述多行文字內容的每一行文字內容區分為多個欄位,以從該些欄位分別擷取該些欄位所記錄的所述多筆資料,其中屬於同一欄位的多個行中的第一行所記錄的文字內容為該同一欄位的欄位名稱,其中屬於同一欄位的該些行中的其他行所記錄的文字內容為對應該同一欄位的該欄位名稱的資料。 The method for extracting data according to claim 1, wherein the fields are extracted from the fields of the target chapters of the text files belonging to the first type of text content. The step of recording the plurality of pieces of data includes: selecting the marked plurality of lines of text content of the target chapter from the text file; and if the target chapter belongs to the first type of chapter, identifying from the selected plurality of lines of text content The name of the target chapter identifies the text content of the name of the target chapter as a chapter name field, and identifies the other text content of the selected multi-line text content as one corresponding to the chapter name field. a chapter content material, and extracting the chapter content material; and if the target chapter belongs to the second type chapter, respectively selecting the plurality of lines of text according to the selected plurality of lines of text content of the target chapter Each line of text content of the content is divided into a plurality of fields, to extract the plurality of pieces of data recorded by the fields from the fields, wherein the plurality of lines belonging to the same field are The text content recorded in one line is the field name of the same field, and the text content recorded in the other lines in the same field is the data corresponding to the field name of the same field. 如申請專利範圍第5項所述的資料擷取方法,其中上述根據該目標章節的被選擇的所述多行文字內容來分別將被選擇的所述多行文字內容的每一行文字內容區分為該些欄位,以從該些欄位分別擷取該些欄位所記錄的所述多筆資料的步驟包括: 根據被選擇的所述多行文字內容的第一行文字內容來辨識該目標章節的該名稱;根據該目標章節的該名稱從一預定規則表查詢該目標章節的一欄位總數目,或根據被選擇的所述多行文字內容的第二行文字內容來辨識該欄位總數目,其中被選擇的所述多行文字內容的第二行文字內容用以表示該目標章節的多個欄位名稱,其中每一該些欄位名稱之間具有一分隔字元;根據每一該些欄位名稱來辨識對應每一該些欄位的字元型態;根據分隔字元將被選擇的所述多行文字內容的第三行文字內容至最後一行文字內容中的每一行文字內容分別區分為多個欄位,以使所述每一行文字內容的被區分的所述多個欄位的數目等於所述目標章節的所述欄位總數目,並且使所述每一行文字內容的被區分的所述多個欄位的資料符合所述資料所屬的欄位的字元型態;以及從所區分的該些欄位分別擷取該些欄位所記錄的所述多筆資料。 The data extraction method of claim 5, wherein the selected plurality of lines of text content according to the selected target chapter respectively distinguish each line of text content of the selected plurality of lines of text content into The fields, wherein the steps of extracting the plurality of pieces of data recorded by the fields from the fields include: Identifying the name of the target chapter according to the first line of text content of the selected plurality of lines of text content; querying the total number of fields of the target chapter from a predetermined rule table according to the name of the target chapter, or according to Selecting a second line of text content of the plurality of lines of text content to identify a total number of the fields, wherein the selected second line of text content of the plurality of lines of text content is used to represent a plurality of fields of the target chapter a name, wherein each of the field names has a separator character; each of the field names identifies a character type corresponding to each of the fields; and the selected character is selected according to the separator character Each line of text content of the plurality of lines of text content to the last line of text content is respectively divided into a plurality of fields, such that the number of the plurality of fields of the line of text is distinguished. Corresponding to the total number of the fields of the target chapter, and making the data of the divided plurality of fields of the line of text content conform to the character type of the field to which the data belongs; distinguish These are the fields that capture the multi-pen information recorded in these fields. 如申請專利範圍第6項所述的資料擷取方法,其中該分隔字元包括一空格。 The method of extracting data as described in claim 6 wherein the separator character comprises a space. 如申請專利範圍第1項所述的資料擷取方法,其中與該使用者的該信用記錄有關之該可攜式文件格式檔案包括該使用者的一個人信用報告。 The method for extracting data according to claim 1, wherein the portable file format file related to the credit record of the user includes a credit report of the user. 一種個人信用分析系統,包括:一向量轉換模組,接收一使用者的一信用記錄之一可攜式文件格式(PDF)檔案,其中該向量轉換模組對該可攜式文件格式檔案進行一向量轉換,以使該可攜式文件格式檔案轉換成一向量檔案,其中該可攜式文件格式檔案具有分別被配置於多個層的多行文字內容,其中該向量檔案的多行文字內容為該可攜式文件格式檔案的配置於該些層的所述多行文字內容,並且該向量檔案的所述多行文字內容皆可被圈選;一清洗模組,根據該可攜式文件格式檔案的該些層來將該向量檔案的所述多行文字內容劃分為一第一類型文字內容與一第二類型文字內容,並且對該向量檔案進行一清洗操作,以使該向量檔案轉換成一文字檔案;一資料擷取模組,從該文字檔案的屬於該第一類型文字內容的多個章節中的一目標章節的多個欄位擷取該些欄位所記錄的多筆資料;一使用者檔案資料庫;以及一信用評比模組,其中該資料擷取模組更根據所述多筆資料分別所屬的該些欄位之欄位名稱將所擷取的所述多筆資料儲存至該使用者檔案資料庫中對應該使用者的一使用者檔案,其中該信用評比模組根據該使用者檔案來計算該使用者的一信用評比, 其中在上述對該向量檔案進行該清洗操作,以使該向量檔案轉換成該文字檔案的運作中,該清洗模組將該向量檔案的部份或全部的所述多行文字內容儲存至該文字檔案的同一層;該清洗模組從該文字檔案的屬於該第一類型文字內容的多行文字內容中辨識多個章節;該清洗模組將所述多個章節劃分為不具有多個欄位的一第一型章節與具有多個欄位的一第二型章節;以及該清洗模組根據該第一型章節與該第二型章節的多個章節的名稱,對應地標記該第一型章節與該第二型章節的該些章節的部份的多行文字內容,其中該第一型章節與該第二型章節的該些章節中不被標記的其他部份的多行文字內容為分隔橫線。 A personal credit analysis system includes: a vector conversion module that receives a portable file format (PDF) file of a credit record of a user, wherein the vector conversion module performs a portable file format file Converting the portable file format file into a vector file, wherein the portable file format file has a plurality of lines of text content respectively configured in a plurality of layers, wherein the plurality of lines of text content of the vector file are The multi-line text content of the portable file format file is configured in the layer, and the multi-line text content of the vector file can be circled; a cleaning module according to the portable file format file The plurality of lines of text of the vector file are divided into a first type of text content and a second type of text content, and a cleaning operation is performed on the vector file to convert the vector file into a text. a file capture module that retrieves the columns from a plurality of fields of a target chapter of the plurality of chapters of the first type of text content of the text file a plurality of records recorded; a user profile database; and a credit rating module, wherein the data capture module further captures the field names of the fields to which the plurality of data belong respectively The plurality of pieces of data are stored in a user file corresponding to the user in the user profile database, wherein the credit rating module calculates a credit rating of the user according to the user profile. In the operation of performing the cleaning operation on the vector file to convert the vector file into the text file, the cleaning module stores part or all of the multi-line text content of the vector file to the text. The same layer of the file; the cleaning module identifies a plurality of chapters from the plurality of lines of text content belonging to the first type of text content of the text file; the cleaning module divides the plurality of chapters into having no plurality of fields a first type chapter and a second type chapter having a plurality of fields; and the cleaning module correspondingly marks the first type according to the names of the first type chapter and the plurality of chapters of the second type chapter a plurality of lines of text content of the chapters and portions of the chapters of the second type of chapters, wherein the first type of chapters and the plurality of lines of text of the other parts of the second type of chapters that are not marked are Separate the horizontal lines.
TW106107965A 2017-03-10 2017-03-10 Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system TWI629605B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW106107965A TWI629605B (en) 2017-03-10 2017-03-10 Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW106107965A TWI629605B (en) 2017-03-10 2017-03-10 Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system

Publications (2)

Publication Number Publication Date
TWI629605B true TWI629605B (en) 2018-07-11
TW201833795A TW201833795A (en) 2018-09-16

Family

ID=63640671

Family Applications (1)

Application Number Title Priority Date Filing Date
TW106107965A TWI629605B (en) 2017-03-10 2017-03-10 Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system

Country Status (1)

Country Link
TW (1) TWI629605B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201229787A (en) * 2010-09-24 2012-07-16 Thales Sa System for the generation of technical documentation in electronic format
TW201413628A (en) * 2012-09-28 2014-04-01 Kun-Li Zhou Transcript parsing system
CN103902662A (en) * 2014-03-06 2014-07-02 杭州施强软件开发有限公司 Test question generating method based on browser
US20160259770A1 (en) * 2015-03-02 2016-09-08 Canon Kabushiki Kaisha Information processing system, server apparatus, control method, and storage medium
CN106354700A (en) * 2016-08-11 2017-01-25 广州爱九游信息技术有限公司 Page text conversion method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201229787A (en) * 2010-09-24 2012-07-16 Thales Sa System for the generation of technical documentation in electronic format
TW201413628A (en) * 2012-09-28 2014-04-01 Kun-Li Zhou Transcript parsing system
CN103902662A (en) * 2014-03-06 2014-07-02 杭州施强软件开发有限公司 Test question generating method based on browser
US20160259770A1 (en) * 2015-03-02 2016-09-08 Canon Kabushiki Kaisha Information processing system, server apparatus, control method, and storage medium
CN106354700A (en) * 2016-08-11 2017-01-25 广州爱九游信息技术有限公司 Page text conversion method and system

Also Published As

Publication number Publication date
TW201833795A (en) 2018-09-16

Similar Documents

Publication Publication Date Title
CN109062874B (en) Financial data acquisition method, terminal device and medium
US10318593B2 (en) Extracting searchable information from a digitized document
US11232300B2 (en) System and method for automatic detection and verification of optical character recognition data
CN102959578B (en) Forensic system and forensic method, and forensic program
CN110765770A (en) Automatic contract generation method and device
CA3117374A1 (en) Sensitive data detection and replacement
CN110688349A (en) Document sorting method, device, terminal and computer readable storage medium
US10643022B2 (en) PDF extraction with text-based key
JP7290391B2 (en) Information processing device and program
JP2020042320A (en) Image recognition apparatus, image recognition method and image recognition program
TWI629605B (en) Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system
TWI645304B (en) Data extracting method for portable document format file corresponding to credit record of user and personal credit analysis system
CN116384344A (en) Document conversion method, device and storage medium
JP2016018279A (en) Document file search program, document file search device, document file search method, document information output program, document information output device, and document information output method
TWI684950B (en) Species data analysis method, system and computer program product
JP5618968B2 (en) Similar page detection device, similar page detection method, and similar page detection program
CN116719839B (en) Data query method and device of accounting file and electronic equipment
WO2022215433A1 (en) Information representation structure analysis device, and information representation structure analysis method
JP7171100B1 (en) A patent document creation support device, a patent document creation support method, and a patent document creation support program.
TWM639744U (en) Form data automatic generation system
JP6500640B2 (en) PRINT DATA GENERATION PROGRAM, PRINT DATA GENERATION DEVICE, AND PRINT DATA GENERATION METHOD
TW201617930A (en) A computer implemented system and method for collating and presenting multi-format information
JP4769379B2 (en) Document search device
CN115730074A (en) File classification method and device, computer equipment and storage medium
JP2021077393A (en) Method and program for efficiently structuring and correcting open data