TWI816141B

TWI816141B - System and method for quickly searching for default lists in documents

Info

Publication number: TWI816141B
Application number: TW110122165A
Authority: TW
Inventors: 王振安; 鐘令淑; 林芊華
Original assignee: 大鐸資訊股份有限公司
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2023-09-21
Also published as: TW202301145A

Abstract

本發明的於文件中快速查找預設名單之系統用於在一具有多數連續的待查找文字的待查找文件中找到可能包含在其中的黑名單資料庫名字資料；處理模組先將黑名單資料庫中的所有名字資料的單字建為黑名單單字列表；再讀取待查找文件中連續的複數待查找單字，比對找出黑名單單字列表所有與其中一待查找文字相符的單字並建為目標單字列表，再計算目標單字列表中屬於同一名字資料的單字數量為命中字數，比對該命中字數與該名字資料的單字數量，若差值在一命中範圍內則判斷命中該名字資料，達到在待查找文件中找出不特定名字資料之目的。The system for quickly searching for a preset list in a file of the present invention is used to find the blacklist database name data that may be included in a search file with a plurality of consecutive words to be searched; the processing module first converts the blacklist data All the words in the name data in the database are built into a blacklist word list; then the consecutive plural words to be found in the file to be searched are read, and all the words in the blacklist word list that match one of the words to be searched are found and built as Target word list, then calculate the number of words belonging to the same name data in the target word list as the number of hit words, compare the number of hit words with the number of words of the name data, if the difference is within a hit range, it is determined that the name data is hit , to achieve the purpose of finding information with unspecified names in the files to be searched.

Description

System and method for quickly searching for default lists in documents

一種查找系統及方法，尤指一種於文件中快速查找預設名單之系統及方法。A search system and method, especially a system and method for quickly searching for a default list in a document.

在全球化的今日，世界各地的金融機構之間往往都有一定程度的往來。為了便於進行溝通，電文系統是金融機構之間傳遞資訊的重要方式。一封電文除了標準規格化的封包形式，其中的主文則無特殊限制，其敘述方式、字數長短等各異，而其中的一主要關鍵，是該主文中可能出現具有不良信用紀錄，或曾出現於不良新聞消息中的人名、公司行號或機關單位之名字。In today's world of globalization, financial institutions around the world often have a certain degree of contact with each other. In order to facilitate communication, the message system is an important way to transmit information between financial institutions. In addition to the standard format of a message, there are no special restrictions on the main text of a message. Its narrative method, word length, etc. are different. One of the main keys is that the main text may appear to have a bad credit record or have had a bad credit record. Names of people, companies, or institutions that appear in bad news.

金融機構一般來說會使用全球性的黑名單資料庫進行檢核，建立檢索系統以在需要時通過該檢索系統搜尋黑名單資料庫的人物資訊。該黑名單資料庫一般來說是關聯式資料庫，當使用者在檢索系統輸入欲搜尋的目標姓名字串後，檢索系統將該目標姓名字串與該黑名單資料庫中的名單列表進行精確比對，並尋找完全符合該目標姓名字串的姓名欄位內容，並在找到相符合的欄位內容後，讀取並輸出該欄位的相關資訊以供使用者瀏覽。Financial institutions generally use global blacklist databases for verification and establish retrieval systems to search for personal information in the blacklist database when necessary. Generally speaking, the blacklist database is a relational database. When the user inputs the target name string to be searched into the retrieval system, the retrieval system will accurately compare the target name string with the list in the blacklist database. Compare and find the name field content that exactly matches the target name string, and after finding the matching field content, read and output the relevant information of the field for the user to browse.

然而，該檢索系統必須由使用者輸入一待檢索名字並實施精確比對，只能檢索出完全符合輸入的待檢索名字的內容。由於檢索系統是將輸入文字依序比對黑名單資料庫中的資訊以進行檢索，當可能為黑名單的名字存在於一整段的不特定文字中，輸入整段文字於檢索系統中進行比對是無意義的，因為檢索系統無法在全篇文字中偵測或提取可能為目標名字的部分文字。當金融機構接收到一封電文時，因為該電文中的主文是連續的且內容無特定格式的連續文句，難以直接判斷主文中何處出現人名、公司行號或機關單位之名字，因此無法利用一般的檢索系統進行黑名單名字的查找。However, this search system requires the user to input a name to be searched and perform precise comparison, and can only retrieve content that completely matches the entered name to be searched. Since the retrieval system sequentially compares the input text with the information in the blacklist database for retrieval, when a name that may be on the blacklist exists in an entire paragraph of unspecified text, the entire paragraph of text is entered for comparison in the retrieval system. The pair is meaningless because the retrieval system cannot detect or extract part of the text that may be the target name in the entire text. When a financial institution receives a message, it is difficult to directly determine where a person's name, company number, or the name of an institution appears in the main message because the main text in the message is continuous and the content has no specific format. Therefore, it cannot be used. General search systems perform blacklist name searches.

此外，該等檢索系統不僅無法自動提取隱藏在整段文字中的名字，同一個人、公司行號或機關單位的名字可能有多種形式，例如姓、名的前後置換、稱謂等。例如姓名「王曉明」在一封電文中可能以「曉明，王」、「王先生曉明」等不同形式出現，更增加了從一篇隨機的電文中找到可能為名字的多個單字並再進一步進行黑名單資料庫檢索的難度。因此，現有的檢索系統尚有改善的空間。In addition, not only are these retrieval systems unable to automatically extract names hidden in the entire text, but the name of the same person, company, bank, or institution may have multiple forms, such as surnames, prefixes, titles, etc. For example, the name "Wang Xiaoming" may appear in different forms in a message such as "Xiaoming, Wang", "Mr. Wang Xiaoming", etc. In addition, it is possible to find multiple words that may be names from a random message and then re- Further difficulty in conducting blacklist database searches. Therefore, there is still room for improvement in the existing retrieval system.

有鑑於現有的黑名單資料庫檢索系統無法有效率的對整封文件進行名單檢索以找出目標名字，本發明提供一種於文件中快速查找預設名單之方法及系統，該方法包含以下步驟：讀取一黑名單資料庫，該黑名單資料庫包含複數紀錄編號、依照各該紀錄編號排序的複數名字資料、複數字數資料，其中各該名字資料包含複數單字，各該字數資料紀錄各該名字資料的單字的數量；根據該黑名單資料庫建立一黑名單單字列表，該黑名單單字列表包含各該名字資料中的每一單字，以及各該單字所屬的名字資料對應的紀錄編號；接收一待查找文件，讀取該待查找文件中的一組連續的複數個待查找單字；根據該組連續的複數個待查找單字比對該黑名單單字列表，將與任一待查找單字相同的至少一單字及該至少一單字對應的至少一紀錄編號建立為一目標單字列表；計算該目標單字列表中對應相同紀錄編號的單字的數量，並將該數量紀錄為對應該至少一紀錄編號的至少一命中字數；比較該至少一命中字數與對應的該至少一紀錄編號對應的字數資料，判斷該至少一命中字數與該至少一字數資料的差值是否在一命中範圍內；若其中一命中字數與對應的該字數資料的差值在該命中範圍內，該名字資料為一命中名字資料；若否，該紀錄編號對應的名字資料非命中名字資料；完成該組待查找單字的比對。 In view of the fact that the existing blacklist database retrieval system cannot efficiently perform a list search on the entire document to find the target name, the present invention provides a method and system for quickly searching for a preset list in a document. The method includes the following steps: Read a blacklist database. The blacklist database contains plural record numbers, plural name data sorted according to each record number, and plural numeric data. Each of the name data includes plural single characters, and each of the word data records contains a plurality of characters. The number of characters in the name data; Create a blacklist word list based on the blacklist database. The blacklist word list includes each word in the name data and the record number corresponding to the name data to which each word belongs; Receive a file to be searched, and read a group of consecutive plural words to be searched in the file to be searched; Based on the group of consecutive plural words to be searched compared to the blacklist word list, at least one word that is the same as any word to be searched and at least one record number corresponding to the at least one word is established as a target word list; Calculate the number of words corresponding to the same record number in the target word list, and record the number as the number of at least one hit word corresponding to the at least one record number; Compare the at least one hit word count with the word count data corresponding to the at least one record number, and determine whether the difference between the at least one hit word count and the at least one word count data is within a hit range; If the difference between one of the hit word numbers and the corresponding word number data is within the hit range, the name data is a hit name data; If not, the name data corresponding to the record number is not the hit name data; Complete the comparison of the set of words to be found.

此外，本發明還提供一種於文件中快速查找預設名單之系統，包含有：一處理模組，連接該黑名單資料庫；一儲存模組，連接該處理模組；其中，該處理模組根據該黑名單資料庫建立一黑名單單字列表，並將該黑名單單字列表儲存於該儲存模組；該黑名單單字列表包含各該名字資料中的每一單字，以及各該單字所屬的名字資料對應的紀錄編號；該處理模組接收一待查找文件，將該待查找文件儲存於該儲存模組，並讀取該待查找文件中的一組連續的複數個待查找單字；該處理模組根據該組連續的複數個待查找單字比對該黑名單單字列表，將與任一待查找單字相同的至少一單字及該至少一單字對應的至少一紀錄編號建立為一目標單字列表，並將該目標單字列表儲存於該儲存模組；該處理模組計算該目標單字列表中對應相同紀錄編號的單字的數量，並將該數量紀錄為對應該至少一紀錄編號的至少一命中字數；該處理模組比較該至少一命中字數與對應的該至少一紀錄編號對應的字數資料，判斷該至少一命中字數與該至少一字數資料的差值是否在一命中範圍內；若其中一命中字數與對應的該字數資料的差值在該命中範圍內，該處理模組判斷該名字資料為一命中名字資料；若否，該處理模組判斷該紀錄編號對應的名字資料非命中名字資料；完成該組待查找單字的比對。 In addition, the present invention also provides a system for quickly searching for a default list in a file, including: A processing module connects to the blacklist database; A storage module connected to the processing module; wherein the processing module creates a blacklist word list based on the blacklist database, and stores the blacklist word list in the storage module; the blacklist word list includes Each word in the name data, and the record number corresponding to the name data to which each word belongs; The processing module receives a file to be searched, stores the file to be searched in the storage module, and reads a group of consecutive plural words to be searched in the file to be searched; The processing module compares the blacklist word list according to the group of consecutive plural words to be searched, and establishes at least one word that is the same as any word to be searched and at least one record number corresponding to the at least one word as a target word. list, and store the target word list in the storage module; The processing module calculates the number of words corresponding to the same record number in the target word list, and records the number as the number of at least one hit word corresponding to the at least one record number; The processing module compares the at least one hit word count with the word count data corresponding to the at least one record number, and determines whether the difference between the at least one hit word count and the at least one word count data is within a hit range; If the difference between one of the hit word numbers and the corresponding word number data is within the hit range, the processing module determines that the name data is a hit name data; If not, the processing module determines that the name data corresponding to the record number is not a hit name data; Complete the comparison of the set of words to be found.

本發明的查找預設名單方法係先將黑名單資料庫中的所有名字資料的所有單字另建為該黑名單單字列表，並對應每一單字紀錄所屬名字資料包含的字數。當收到待查找文件時，讀取其中一組連續的複數的待查找文字，同樣分別根據該等待查找文字對該黑名單單字列表進行查找，將相同的單字，包含黑名單單字列表中重複但對應不同紀錄編號的所有單字，建立為目標單字列表，然後計算該目標單字列表中對應同一紀錄編號，即屬於同一筆名字資料的單字的數量，並將該等命中字數對應每一筆紀錄編號紀錄於該目標單字列表中。最後，比對該目標單字列表中每一筆紀錄編號的命中字數及在黑名單資料庫中的字數資料，若兩者的差在一命中範圍內，表示該複數待查找文字中有多個單字與一名字資料的單字相同，該複數待查找文字中有很高機率包含了該筆名字資料，因此判斷該名字資料為命中名字資料。The method for searching the default list of the present invention first constructs all the words of all the name data in the blacklist database into the blacklist word list, and records the number of words contained in the name data to which each word belongs. When the file to be searched is received, a group of consecutive plural words to be searched are read, and the blacklist word list is also searched based on the words to be searched, and the same words, including those that are repeated in the blacklist word list, are All words corresponding to different record numbers are created as a target word list, and then the number of words in the target word list corresponding to the same record number, that is, the same name data, is calculated, and the number of hit words is corresponding to each record number record. in the target word list. Finally, compare the number of hit words for each record number in the target word list with the number of words in the blacklist database. If the difference between the two is within a hit range, it means that there are multiple words in the plural words to be searched. The single word is the same as the single word of a name data. There is a high probability that the plural text to be searched contains the name data. Therefore, the name data is judged to be the hit name data.

舉例而言，黑名單資料庫中的名字為「王曉明」，字數為3，其對應的字數資料為「3」；由待查找文件中讀取的該複數待查找單字為「王先生曉明」，經過上述查找方式會判斷該複數待查找單字中的命中字數為3，命中字數與字數資料差值為0，因此判斷該名字資料「王曉明」為命中名字資料。For example, the name in the blacklist database is "Wang Xiaoming", the number of characters is 3, and the corresponding character number data is "3"; the plural word to be searched read from the file to be searched is "Mr. Wang Xiaoming" "Ming", after the above search method, it will be judged that the number of hit words in the plural word to be searched is 3, and the difference between the number of hit words and the word number data is 0, so the name data "Wang Xiaoming" is judged to be the hit name data.

本發明的方法及系統藉由將黑名單資料庫中的所有名字資料拆解為黑名單單字列表，並將待查找文字中的複數待查找單字分別比對，並且以命中字數的方式判斷該複數待查找文字中是否可能包含有一筆名字資料。由於此一查找方法不限制該複數待查找單字或黑名單資料庫中名字資料的單字的順序，因此無論該名字資料的單字順序與該複數待查找單字中包含的名字單字順序是否相同，或其中是否插入冗字，其中包含的名字皆能夠被找出來，解決習知黑名單資料庫檢索系統無法於整篇文件中找出預設名單中的名字資料的問題。The method and system of the present invention decompose all the name data in the blacklist database into a list of blacklist words, compare the plural words to be found in the text to be found, and judge the words based on the number of hits. Whether the plural text to be searched for may contain a name data. Since this search method does not limit the order of the words in the plural word to be searched or the name data in the blacklist database, it does not matter whether the order of the words in the name data is the same as the order of the name words contained in the plural word to be searched, or whether there is Whether redundant words are inserted or not, the names included in them can all be found, which solves the problem that the conventional blacklist database search system cannot find the name data in the default list in the entire document.

以下配合圖式及本發明的實施例，進一步闡述本發明為達成預定發明目的所採取的技術手段。The technical means adopted by the present invention to achieve the intended invention purpose will be further described below with reference to the drawings and embodiments of the present invention.

請參閱圖1及圖2所示，本發明的於文件中快速查找預設名單之系統包含一處理模組10及一儲存模組20，該處理模組10連接一黑名單資料庫30，該處理模組10接收一待查找文件，並執行本發明的於文件中快速查找預設名單之方法。該儲存模組20連接該處理模組10，用於儲存或暫存該待查找文件、黑名單單字列表及目標單字列表。該處理模組10例如是一伺服器、一個人電腦等電子計算機裝置之主要處理元件；該儲存模組20是一儲存裝置，例如一傳統硬碟（HDD）、一固態硬碟（SSD）等，且較佳的，對該處理模組10而言是一近端儲存裝置；該黑名單資料庫30例如是建立於一雲端伺服器，以供管理單位隨時進行更新，而該處理模組10通過網際網路連接並讀取該黑名單資料庫30，並據以建立該黑名單單字列表，儲存於本地的儲存模組20中。Referring to Figures 1 and 2, the system for quickly searching for a default list in a document of the present invention includes a processing module 10 and a storage module 20. The processing module 10 is connected to a blacklist database 30. The processing module 10 receives a file to be searched, and executes the method of quickly searching the preset list in the file according to the present invention. The storage module 20 is connected to the processing module 10 and is used to store or temporarily store the file to be found, the blacklist word list and the target word list. The processing module 10 is, for example, the main processing element of an electronic computer device such as a server or a personal computer; the storage module 20 is a storage device, such as a traditional hard disk (HDD), a solid state disk (SSD), etc. Preferably, the processing module 10 is a local storage device; the blacklist database 30 is, for example, established on a cloud server for the management unit to update at any time, and the processing module 10 passes The Internet connects to and reads the blacklist database 30, and accordingly creates the blacklist word list, which is stored in the local storage module 20.

該待查找文件例如是一銀行單位或金管單位透過網際網路接收到的由其他相關單位發出的一封電文，其中包含了不定長度及無特定規格的內容文字。當該處理模組10接收到該待查找文件時，暫存於該儲存模組20中。The document to be searched is, for example, a message received by a banking unit or financial management unit through the Internet and sent by other relevant units, which contains content text of indefinite length and no specific specifications. When the processing module 10 receives the file to be found, it is temporarily stored in the storage module 20 .

本發明的於文件中快速查找預設名單之方法包含以下步驟：處理模組10讀取一黑名單資料庫30，該黑名單資料庫30包含複數紀錄編號、依照各該紀錄編號排序的複數名字資料及複數字數資料（S101）；其中，各該名字資料包含複數單字，各該字數資料紀錄各該名字資料的單字的數量；處理模組10根據該黑名單資料庫30建立一黑名單單字列表（S102）；該黑名單單字列表包含各該名字資料中的每一單字，以及單字所屬的名字資料對應的紀錄編號；處理模組10接收一待查找文件，讀取該待查找文件中的一組連續的複數個待查找單字（S103）；換言之，該待查找文件包含連續的複數個原始單字，該組連續的複數個待查找單字是一字集，其為該待查找文件中之部分連續原始單字，本發明的實施例中，該處理模組10根據一單次比對字數從該待查找文件讀取出該組連續的複數個待查找單字，也就是說，該組連續的複數個待查找單字的字數等於該單次比對字數，該單次比對字數為一預設值；處理模組10根據各該待查找單字比對該黑名單單字列表，將與任一待查找單字相同的至少一單字及該至少一單字對應的至少一紀錄編號建立為一目標單字列表（S104）；該目標單字列表儲存於該儲存模組20中；處理模組10計算該目標單字列表中對應相同紀錄編號的單字的數量，並將該數量紀錄為對應該至少一紀錄編號的至少一命中字數（S105）；處理模組10比較該至少一命中字數與對應的該至少一紀錄編號對應的字數資料，判斷該至少一命中字數與該至少一字數資料的差值是否在一命中範圍內（S106）；該命中範圍為一使用者預設值，命中字數與字數資料之差值在命中範圍內是指該差值小於該命中範圍之數值；若其中一命中字數與對應的該字數資料的差值在該命中範圍內，處理模組10判斷該名字資料為一命中名字資料（S107）；若否，處理模組10判斷該紀錄編號對應的名字資料非命中名字資料；完成該組待查找單字的比對（S108）。 The method of quickly searching for a preset list in a file according to the present invention includes the following steps: The processing module 10 reads a blacklist database 30. The blacklist database 30 includes plural record numbers, plural name data and plural numeric data sorted according to the record numbers (S101); wherein, each name data includes For plural words, each character number data records the number of words in each name data; The processing module 10 creates a blacklist word list based on the blacklist database 30 (S102); the blacklist word list includes each word in the name data and the record number corresponding to the name data to which the word belongs; The processing module 10 receives a file to be searched, and reads a group of consecutive plural words to be searched in the file to be searched (S103); in other words, the file to be searched contains a plurality of consecutive original words, and the group of consecutive plural words is A word to be searched is a word set, which is a part of the continuous original words in the file to be searched. In the embodiment of the present invention, the processing module 10 reads out the words from the file to be searched based on the number of words in a single comparison. The number of words in the group of consecutive plural words to be searched for is equal to the number of words in the single comparison, and the number of words in the single comparison is a preset value; The processing module 10 compares each word to be searched with the blacklist word list, and creates at least one word that is the same as any word to be searched and at least one record number corresponding to the at least one word as a target word list (S104) ;The target word list is stored in the storage module 20; The processing module 10 calculates the number of words corresponding to the same record number in the target word list, and records the number as the number of at least one hit word corresponding to the at least one record number (S105); The processing module 10 compares the at least one hit word count with the word count data corresponding to the at least one record number, and determines whether the difference between the at least one hit word count and the at least one word count data is within a hit range (S106 ); the hit range is a user-default value. The difference between the hit word count and the word count data within the hit range means that the difference is smaller than the hit range; If the difference between one of the hit word numbers and the corresponding word number data is within the hit range, the processing module 10 determines that the name data is a hit name data (S107); If not, the processing module 10 determines that the name data corresponding to the record number is not a hit name data; Complete the comparison of the group of words to be found (S108).

請參閱圖3所示，在本發明的一實施例中，該黑名單單字列表中的單字是依照各單字的值排序，使得相同的單字排列在一起。其中，在建立該黑名單單字列表的步驟（S102）中，可根據以下子步驟進行：處理模組10依據該黑名單資料庫30中的各紀錄編號，依序讀取各名字資料中的每一單字，暫存為一全單字列表（S1021）；該全單字列表中的各單字依照各紀錄編號及名字資料中的單字順序加以排列；處理模組10將該全單字列表中的各單字依照單字的值重新排列，儲存為該黑名單單字列表（S1022）。 Please refer to FIG. 3 . In one embodiment of the present invention, the words in the blacklist word list are sorted according to the value of each word, so that the same words are arranged together. Among them, in the step of establishing the blacklist word list (S102), the following sub-steps can be performed: The processing module 10 sequentially reads each word in each name data according to the record number in the blacklist database 30, and temporarily stores it as a full word list (S1021); each word in the full word list is based on Arrange the order of the words in each record number and name data; The processing module 10 rearranges each word in the full word list according to the value of the word and stores it in the blacklist word list (S1022).

由於本發明的黑名單單字列表已儲存於儲存模組20，當處理模組10進行待查找單字與黑名單單字列表中的單字進行比對時，可直接由近端的儲存模組20進行讀取，不須再通過網際網路存取遠端的黑名單資料庫30，因而提高處理模組10進行查找比對的速度；此外，處理模組10係以待查找單字對黑名單單字列表中的單字進行一對一比對，且黑名單單字列表已依照各該單字的值排列，當處理模組10找到相符的單字時，便能在儲存模組20中相近的儲存位址找到其他相符的單字，進一步提高查找比對之效率；因此，相較以輸入姓名字串對黑名單資料庫中的名字資料字串進行複雜的交叉比對，本發明在整篇待查找文件中查找出可能的目標姓名將更有明顯的處理速度優勢。Since the blacklist word list of the present invention has been stored in the storage module 20, when the processing module 10 compares the word to be found with the words in the blacklist word list, it can be directly read by the proximal storage module 20. There is no need to access the remote blacklist database 30 through the Internet, thus increasing the speed of search and comparison by the processing module 10; in addition, the processing module 10 pairs the word to be searched in the blacklist word list The words in the blacklist are compared one-to-one, and the blacklist word list has been arranged according to the value of each word. When the processing module 10 finds a matching word, it can find other matching words at a similar storage address in the storage module 20 words to further improve the efficiency of search and comparison; therefore, compared with inputting name strings to conduct complex cross-comparison of name data strings in the blacklist database, the present invention can find possible names in the entire document to be searched. The target name will have a clear processing speed advantage.

在本發明的一實施例中，當處理模組10接收到待查找文件時，係先建立一待查找列表並儲存於儲存模組20中，該待查找列表中包含該待查找文件中的連續的複數組待查找單字，且該複數組待查找單字包含該待查找文件中的所有待查找單字。該處理模組10在建立該待查找列表後，先由第一組待查找單字進行步驟S104至S107的比對，比對完成後，再移至該待查找列表中的下一組待查找單字進行比對，直到完成該待查找列表中各組待查找單字的比對。In an embodiment of the present invention, when the processing module 10 receives a file to be searched, it first creates a search list and stores it in the storage module 20. The search list includes consecutive files in the file to be searched. The plural group of words to be searched for contains all the words to be searched for in the file to be searched. After establishing the search list, the processing module 10 first performs the comparison of steps S104 to S107 from the first group of words to be searched. After the comparison is completed, it moves to the next group of words in the list to be searched. Comparison is performed until the comparison of each group of words to be found in the list to be found is completed.

請參閱圖4所示，較佳的，該待查找列表根據以下子步驟建立：處理模組10由該待查找文件中起始的一待查找單字開始，讀取連續的一單次比對數量個待查找單字，並儲存至該待查找列表（S401）；處理模組10由上一步驟中讀取的複數個待查找單字中的第二個待查找單字開始，讀取連續的該單次比對數量的待查找單字，並儲存至該待查找列表（S402）；處理模組10重複執行上一步驟，直到讀取的複數個待查找單字的一末單字為該待查找文件中的最末端的待查找單字（S403）；完成建立該待查找列表。 Please refer to Figure 4. Preferably, the to-be-searched list is established according to the following sub-steps: The processing module 10 starts from the first word to be found in the file to be found, reads a continuous single comparison of a number of words to be found, and stores them in the to-be-searched list (S401); The processing module 10 starts from the second word to be searched among the plural words to be searched read in the previous step, reads the continuous single comparison number of words to be searched, and stores them in the search list ( S402); The processing module 10 repeats the previous step until the last word of the plurality of words to be searched is the last word to be searched in the file to be searched (S403); Complete the creation of the to-be-found list.

也就是說，該待查找列表的建立方式是由起始單字開始，讀取並儲存單次比對數量個待查找單字後，平移一個單字，由文件中的第二個待查找單字開始再讀取並儲存單次比對數量個待查找單字，並再次平移讀取並儲存，直到讀取並儲存整個待查找文件中的最末端的結束單字。如此一來，該待查找列表中包含有該待查找文件中從起始單字至結束單字中的所有連續的複數個待查找單字。該單次比對數量則決定了該處理模組10在進行一次命中比對中所要比對的待查找單字數量，其可根據黑名單資料庫30中的名字資料的字數平均值、最高值或根據經驗法則訂定。舉例而言，若該單次比對數量為5，即代表每一組待查找單字中包含連續的5個待查找單字。That is to say, the way to create the search list is to start from the starting word, read and store the number of words to be searched for in a single comparison, then move one word, and start reading from the second word to be searched in the file. Fetch and store the number of words to be searched for in a single comparison, and read and store them in translation again until the end word in the entire file to be searched is read and stored. In this way, the search list includes all consecutive words to be searched from the starting word to the end word in the file to be searched. The number of single comparisons determines the number of words to be found that the processing module 10 compares in a hit comparison, which can be based on the average number of words and the highest value of the name data in the blacklist database 30 Or based on rules of thumb. For example, if the number of single comparisons is 5, it means that each group of words to be searched for contains 5 consecutive words to be searched for.

其中，當讀取連續的單次比對數量個待查找單字時，該處理模組10還進一步判斷該等待查找單字中是否包含有連續且相同的二個待查找單字；若有，則忽略該連續且相同的其中一個待查找單字，並將下一待查找單字更新至該組待查找單字中。When reading a number of consecutive words to be searched for in a single comparison, the processing module 10 further determines whether the word to be searched contains two consecutive and identical words to be searched; if so, the processing module 10 ignores the word to be searched. One of the consecutive and identical words to be found, and the next word to be found is updated to the group of words to be found.

也就是說，在建立該待查找列表時，當其中一組待查找單字中包含有連續且重複的待查找單字時，則忽略連續且重複的其中一待查找單字，並進一步讀取下一個待查找單字，避免同一組待查找單字中出現重複的待查找單字而導致重複比對。That is to say, when establishing the search list, when one of the groups of words to be searched contains consecutive and repeated words to be searched, one of the consecutive and repeated words to be searched is ignored, and the next word to be searched is further read. Search words to avoid repeated comparisons caused by duplicate words to be found in the same set of words to be searched.

在本發明的另一實施例中，在建立該黑名單單字列表之前，係先根據包含有複數筆常用字彙的一常用字彙列表比對各該名字資料，移除各該名字資料中包含的常用字彙，才根據已移除常用字彙的各該名字資料建立該黑名單單字列表。In another embodiment of the present invention, before establishing the blacklist word list, each name data is compared against a common word list containing a plurality of commonly used words, and the common words contained in each name data are removed. Vocabulary, the blacklist word list is created based on the name data from which the common vocabulary has been removed.

在本實施例中，該等常用字彙例如為中文的「公司」、「有限公司」、「財團法人」；英文的「COMPANY LIMITED」、「COMPANY」、「LIMITED」、「IMPORT EXPORT CORP」、「IMPORT EXPORT CORPORATION」、「IMPORT AND EXPORT CORPORATION」等。由於該等常用字彙在黑名單資料庫30屬於不具有名字代表性意義的雜訊文字，因此當判斷該名字資料中包含有該等常用字彙時，會先移除該等常用字彙，使黑名單單字列表中的單字數量更為精簡，提高比對效率。In this embodiment, these commonly used words are, for example, "company", "limited company", and "foundation" in Chinese; "COMPANY LIMITED", "COMPANY", "LIMITED", "IMPORT EXPORT CORP", " IMPORT EXPORT CORPORATION", "IMPORT AND EXPORT CORPORATION", etc. Since these common words in the blacklist database 30 are noisy words that do not have the representative meaning of names, when it is determined that the name data contains these common words, these common words will be removed first to make the blacklist The number of words in the word list is more streamlined, improving comparison efficiency.

以下將以範例說明本發明的於文件中快速查找預設名單之方法。The method of quickly searching for a default list in a document according to the present invention will be described below with an example.

在本範例中，黑名單資料庫30中所包含的紀錄編號、名字資料及字數資料如下方的表一所示。紀錄編號名字資料字數資料 R1 Doosan Heavy Industries Construction Company Limited C6 R2 Korea Heavy Industries Construction Company Limited C6 R3 Giad Heavy Industries Complex C4 R4 Esfahan Chemical Industries C3 R5 Canadian Spooner Industries Corporation C4 R6 Khartoum Industrial Complex Giad C4 R7 Hadid Industrial Complex C3 R8 Shohadayeh Hadid Industries C3 R9 Nuclear Fuel Complex C3 R10 Kim Chaek Iron And Steel Complex C6 R11 Namhung Chemical Union Complex C4 R12 Giad Cars Heavy Trucks Company C5 R13 Heavy Electrical Complex Private Limited C5 R14 Danbel Industries Incorporated C3 R15 Pakistan Aeronautical Complex C3 R16 Heavy Mechanical Complex C3 R17 Giad Metal Industries C3 R18 Power Construction Complex of Unified Energy System of Russia Joint Stock Company C12 R19 Heavy Water Board C3 R20 Bharat Heavy Electricals Limited C4 R21 Heavy Vehicles Design and Engeeniring Private Joint Stock Company C9 R22 Iran Shipbuilding and Offshore Industries Complex Company C7 R23 Oil Industries Management Services Private Joint Stock Company C8 R24 Moscow Design Industrial Complex Universal Federal State Unitary Enterprise C9 R25 Oil Industries Engineering Construction Public Joint Stock Company C8 R26 Farasakht Industries C2 R27 Iran Aircraft Manufacturing Industries C4 R28 Sairan Telecommuncation Industries Private Joint Stock Company C7 R29 Shiraz Electronics Industries C3 R30 Thong Guan Industries Berhad C4 表一 In this example, the record number, name data and word count data contained in the blacklist database 30 are as shown in Table 1 below. record number Name information word count data R1 Doosan Heavy Industries Construction Company Limited C6 R2 Korea Heavy Industries Construction Company Limited C6 R3 Giad Heavy Industries Complex C4 R4 Esfahan Chemical Industries C3 R5 Canadian Spooner Industries Corporation C4 R6 Khartoum Industrial Complex Giad C4 R7 Hadid Industrial Complex C3 R8 Shohadayeh Hadid Industries C3 R9 Nuclear Fuel Complex C3 R10 Kim Chaek Iron And Steel Complex C6 R11 Namhung Chemical Union Complex C4 R12 Giad Cars Heavy Trucks Company C5 R13 Heavy Electrical Complex Private Limited C5 R14 Danbel Industries Incorporated C3 R15 Pakistan Aeronautical Complex C3 R16 Heavy Mechanical Complex C3 R17 Giad Metal Industries C3 R18 Power Construction Complex of Unified Energy System of Russia Joint Stock Company C12 R19 Heavy Water Board C3 R20 Bharat Heavy Electricals Limited C4 R21 Heavy Vehicles Design and Engeeniring Private Joint Stock Company C9 R22 Iran Shipbuilding and Offshore Industries Complex Company C7 R23 Oil Industries Management Services Private Joint Stock Company C8 R24 Moscow Design Industrial Complex Universal Federal State Unitary Enterprise C9 R25 Oil Industries Engineering Construction Public Joint Stock Company C8 R26 Farasakht Industries C2 R27 Iran Aircraft Manufacturing Industries C4 R28 Sairan Telecommuncation Industries Private Joint Stock Company C7 R29 Shiraz Electronics Industries C3 R30 Thong Guan Industries Berhad C4 Table I

其中，C1表示計數值為1，C2表示計數值為2，以此類推，Cn表示計數值為n。因此，字數資料為C1表示該筆名字資料包含1個單字，C2表示該筆名字資料包含2個單字，以此類推，Cn即代表名字資料包含有n個單字。Among them, C1 indicates that the count value is 1, C2 indicates that the count value is 2, and so on, and Cn indicates that the count value is n. Therefore, the word count data is C1, which means that the name data contains 1 word, C2 means that the name data contains 2 words, and so on, Cn means that the name data contains n words.

根據步驟S102及其子步驟所建立的該黑名單單字列表如下方的表二所示：單字紀錄編號 AERONAUTICAL R15 AIRCRAFT R27 AND R10 AND R21 AND R22 BERHAD R30 BHARAT R20 BOARD R19 CANADIAN R5 CARS R12 CHAEK R10 CHEMICAL R4 CHEMICAL R11 COMPANY R1 COMPANY R2 COMPANY R12 COMPANY R18 COMPANY R21 COMPANY R22 COMPANY R23 COMPANY R25 COMPANY R28 COMPLEX R3 COMPLEX R6 COMPLEX R7 COMPLEX R9 COMPLEX R10 COMPLEX R11 COMPLEX R13 COMPLEX R15 COMPLEX R16 COMPLEX R18 COMPLEX R22 COMPLEX R24 CONSTRUCTION R1 CONSTRUCTION R2 CONSTRUCTION R18 CONSTRUCTION R25 CORPORATION R5 DANBEL R14 DESIGN R21 DESIGN R24 DOOSAN R1 ELECTRICAL R13 ELECTRICALS R20 ELECTRONICS R29 ENERGY R18 ENGEENIRING R21 ENGINEERING R25 ENTERPRISE R24 ESFAHAN R4 FARASAKHT R26 FEDERAL R24 FUEL R9 GIAD R3 GIAD R6 GIAD R12 GIAD R17 GUAN R30 HADID R7 HADID R8 HEAVY R1 HEAVY R2 HEAVY R3 HEAVY R12 HEAVY R13 HEAVY R16 HEAVY R19 HEAVY R20 HEAVY R21 INCORPORATED R14 INDUSTRIAL R6 INDUSTRIAL R7 INDUSTRIAL R24 INDUSTRIES R1 INDUSTRIES R2 INDUSTRIES R3 INDUSTRIES R4 INDUSTRIES R5 INDUSTRIES R8 INDUSTRIES R14 INDUSTRIES R17 INDUSTRIES R22 INDUSTRIES R23 INDUSTRIES R25 INDUSTRIES R26 INDUSTRIES R27 INDUSTRIES R28 INDUSTRIES R29 INDUSTRIES R30 IRAN R22 IRAN R27 IRON R10 JOINT R18 JOINT R21 JOINT R23 JOINT R25 JOINT R28 KHARTOUM R6 KIM R10 KOREA R2 LIMITED R1 LIMITED R2 LIMITED R13 LIMITED R20 MANAGEMENT R23 MANUFACTURING R27 MECHANICAL R16 METAL R17 MOSCOW R24 NAMHUNG R11 NUCLEAR R9 OF R18 OF R18 OFFSHORE R22 OIL R23 OIL R25 PAKISTAN R15 POWER R18 PRIVATE R13 PRIVATE R21 PRIVATE R23 PRIVATE R28 PUBLIC R25 RUSSIA R18 SAIRAN R28 SERVICES R23 SHIPBUILDING R22 SHIRAZ R29 SHOHADAYEH R8 SPOONER R5 STATE R24 STEEL R10 STOCK R18 STOCK R21 STOCK R23 STOCK R25 STOCK R28 SYSTEM R18 TELECOMMUNCATION R28 THONG R30 TRUCKS R12 UNIFIED R18 UNION R11 UNITARY R24 UNIVERSAL R24 VEHICLES R21 WATER R19 表二 The blacklist word list established according to step S102 and its sub-steps is shown in Table 2 below: single word record number AERONAUTICAL R15 AIRCRAFT R27 AND R10 AND R21 AND R22 BERHAD R30 BHARAT R20 BOARD R19 CANADIAN R5 CARS R12 CHAEK R10 CHEMICAL R4 CHEMICAL R11 COMPANY R1 COMPANY R2 COMPANY R12 COMPANY R18 COMPANY R21 COMPANY R22 COMPANY R23 COMPANY R25 COMPANY R28 COMPLEX R3 COMPLEX R6 COMPLEX R7 COMPLEX R9 COMPLEX R10 COMPLEX R11 COMPLEX R13 COMPLEX R15 COMPLEX R16 COMPLEX R18 COMPLEX R22 COMPLEX R24 CONSTRUCTION R1 CONSTRUCTION R2 CONSTRUCTION R18 CONSTRUCTION R25 CORPORATION R5 DANBEL R14 DESIGN R21 DESIGN R24 DOOSAN R1 ELECTRICAL R13 ELECTRICALS R20 ELECTRONICS R29 能源 R18 ENGEENIRING R21 ENGINEERING R25 ENTERPRISE R24 ESFAHAN R4 FARASAKHT R26 FEDERAL R24 FUEL R9 GIAD R3 GIAD R6 GIAD R12 GIAD R17 GUAN R30 HADID R7 HADID R8 HEAVY R1 HEAVY R2 HEAVY R3 HEAVY R12 HEAVY R13 HEAVY R16 HEAVY R19 HEAVY R20 HEAVY R21 INCORPORATED R14 INDUSTRIAL R6 INDUSTRIAL R7 INDUSTRIAL R24 INDUSTRIES R1 INDUSTRIES R2 INDUSTRIES R3 INDUSTRIES R4 INDUSTRIES R5 INDUSTRIES R8 INDUSTRIES R14 INDUSTRIES R17 INDUSTRIES R22 INDUSTRIES R23 INDUSTRIES R25 INDUSTRIES R26 INDUSTRIES R27 INDUSTRIES R28 INDUSTRIES R29 INDUSTRIES R30 IRAN R22 IRAN R27 IRON R10 JOINT R18 JOINT R21 JOINT R23 JOINT R25 JOINT R28 KHARTOUM R6 KIM R10 KOREA R2 LIMITED R1 LIMITED R2 LIMITED R13 LIMITED R20 MANAGEMENT R23 MANUFACTURING R27 MECHANICAL R16 METAL R17 MOSCOW R24 NAMHUNG R11 NUCLEAR R9 OF R18 OF R18 OFFSHORE R22 OIL R23 OIL R25 PAKISTAN R15 POWER R18 PRIVATE R13 PRIVATE R21 PRIVATE R23 PRIVATE R28 PUBLIC R25 RUSSIA R18 SAIRAN R28 SERVICES R23 SHIPBUILDING R22 SHIRAZ R29 SHOHADAYEH R8 SPOONER R5 STATE R24 STEEL R10 STOCK R18 STOCK R21 STOCK R23 STOCK R25 STOCK R28 SYSTEM R18 TELECOMMUNCATION R28 THONG R30 TRUCKS R12 UNIFIED R18 UNION R11 UNITARY R24 UNIVERSAL R24 VEHICLES R21 WATER R19 Table II

在本範例中，待查找文件的內容如下：「REGARDING OUR ACKNOWLEDGEMENT CONCERNING GIAD HEAVY INDUSTRIES COMPLEX DATED DD 20200929 WE HAVE TODAY SENT A SECOND REMINDER ON YOUR BEHALF. FOR ANY FUTURE CORRESPONDENCE RELATED TO THIS CASE PLEASE QUOTE OUR ENQUIRY REFERENCE USP200928-000830. REGARDS CLIENT SERVICES」 In this example, the content of the file to be found is as follows: "REGARDING OUR ACKNOWLEDGEMENT CONCERNING GIAD HEAVY INDUSTRIES COMPLEX DATED DD 20200929 WE HAVE TODAY SENT A SECOND REMINDER ON YOUR BEHALF. FOR ANY FUTURE CORRESPONDENCE RELATED TO THIS CASE PLEASE QUOTE OUR ENQUIRY REFERENCE USP200928-000830. REGARDS CLIENT SERVICES"

根據步驟S401所建立的待查找列表如下方的表三所示。其中，示例性地設定該單次比對字數為5，因此每一組待查找單字中包含5個連續的待查找單字，表三中總共列出了32組待查單字（WL1~WL32）：待查找列表 WL1 REGARDING,OUR,ACKNOWLEDGEMENT,CONCERNING,GIAD WL2 OUR,ACKNOWLEDGEMENT,CONCERNING,GIAD,HEAVY WL3 ACKNOWLEDGEMENT,CONCERNING,GIAD,HEAVY,INDUSTRIES WL4 CONCERNING,GIAD,HEAVY,INDUSTRIES,COMPLEX WL5 GIAD,HEAVY,INDUSTRIES,COMPLEX,DATED WL6 HEAVY,INDUSTRIES,COMPLEX,DATED,DD WL7 INDUSTRIES,COMPLEX,DATED,DD,WE WL8 COMPLEX,DATED,DD,WE,HAVE WL9 DATED,DD,WE,HAVE,TODAY WL10 DD,WE,HAVE,TODAY,SENT WL11 WE,HAVE,TODAY,SENT,SECOND WL12 HAVE,TODAY,SENT,SECOND,REMINDER WL13 TODAY,SENT,SECOND,REMINDER,ON WL14 SENT,SECOND,REMINDER,ON,YOUR WL15 SECOND,REMINDER,ON,YOUR,BEHALF WL16 REMINDER,ON,YOUR,BEHALF,FOR WL17 ON,YOUR,BEHALF,FOR,ANY WL18 YOUR,BEHALF,FOR,ANY,FUTURE WL19 BEHALF,FOR,ANY,FUTURE,CORRESPONDENCE WL20 FOR,ANY,FUTURE,CORRESPONDENCE,RELATED WL21 ANY,FUTURE,CORRESPONDENCE,RELATED,TO WL22 FUTURE,CORRESPONDENCE,RELATED,TO,THIS WL23 CORRESPONDENCE,RELATED,TO,THIS,CASE WL24 RELATED,TO,THIS,CASE,PLEASE WL25 TO,THIS,CASE,PLEASE,QUOTE WL26 THIS,CASE,PLEASE,QUOTE,OUR WL27 CASE,PLEASE,QUOTE,OUR,ENQUIRY WL28 PLEASE,QUOTE,OUR,ENQUIRY,REFERENCE WL29 QUOTE,OUR,ENQUIRY,REFERENCE,USP WL30 OUR,ENQUIRY,REFERENCE,USP,REGARDS WL31 ENQUIRY,REFERENCE,USP,REGARDS,CLIENT WL32 REFERENCE,USP,REGARDS,CLIENT,SERVICES 表三 The search list established according to step S401 is shown in Table 3 below. Among them, the number of words in a single comparison is exemplarily set to 5, so each group of words to be searched contains 5 consecutive words to be searched. Table 3 lists a total of 32 groups of words to be searched (WL1~WL32). : to be found list WL1 REGARDING,OUR,ACKNOWLEDGEMENT,CONCERNING,GIAD WL2 OUR,ACKNOWLEDGEMENT,CONCERNING,GIAD,HEAVY WL3 ACKNOWLEDGEMENT,CONCERNING,GIAD,HEAVY,INDUSTRIES WL4 CONCERNING,GIAD,HEAVY,INDUSTRIES,COMPLEX WL5 GIAD,HEAVY,INDUSTRIES,COMPLEX,DATED WL6 HEAVY,INDUSTRIES,COMPLEX,DATED,DD WL7 INDUSTRIES,COMPLEX,DATED,DD,WE WL8 COMPLEX,DATED,DD,WE,HAVE WL9 DATED,DD,WE,HAVE,TODAY WL10 DD,WE,HAVE,TODAY,SENT WL11 WE,HAVE,TODAY,SENT,SECOND WL12 HAVE,TODAY,SENT,SECOND,REMINDER WL13 TODAY,SENT,SECOND,REMINDER,ON WL14 SENT,SECOND,REMINDER,ON,YOUR WL15 SECOND,REMINDER,ON,YOUR,BEHALF WL16 REMINDER,ON,YOUR,BEHALF,FOR WL17 ON,YOUR,BEHALF,FOR,ANY WL18 YOUR,BEHALF,FOR,ANY,FUTURE WL19 BEHALF,FOR,ANY,FUTURE,CORRESPONDENCE WL20 FOR,ANY,FUTURE,CORRESPONDENCE,RELATED WL21 ANY,FUTURE,CORRESPONDENCE,RELATED,TO WL22 FUTURE,CORRESPONDENCE,RELATED,TO,THIS WL23 CORRESPONDENCE,RELATED,TO,THIS,CASE WL24 RELATED,TO,THIS,CASE,PLEASE WL25 TO,THIS,CASE,PLEASE,QUOTE WL26 THIS,CASE,PLEASE,QUOTE,OUR WL27 CASE,PLEASE,QUOTE,OUR,ENQUIRY WL28 PLEASE,QUOTE,OUR,ENQUIRY,REFERENCE WL29 QUOTE,OUR,ENQUIRY,REFERENCE,USP WL30 OUR,ENQUIRY,REFERENCE,USP,REGARDS WL31 ENQUIRY,REFERENCE,USP,REGARDS,CLIENT WL32 REFERENCE,USP,REGARDS,CLIENT,SERVICES Table 3

在步驟S104中，該處理模組10將各組待查找單字中的待查找單字與黑名單單字列表(表二)互相比對，將與黑名單單字列表中的單字相同的任一個待查找單字以及該黑名單單字列表中的單字對應的紀錄編號建立為目標單字列表。例如由第一組待查找單字（WL1）進行比對後建立的目標單字列表如下方的表四所示：紀錄編號無 REGARDING 無 OUR 無 ACKNOWLEDGEMENT 無 CONCERNING R3 GIAD R6 GIAD R12 GIAD R17 GIAD 表四 In step S104, the processing module 10 compares the words to be searched in each group of words to be searched with the blacklist word list (Table 2), and compares any word to be searched that is the same as a word in the blacklist word list. And the record numbers corresponding to the words in the blacklist word list are established as the target word list. For example, the target word list established after comparing the first group of words to be found (WL1) is as shown in Table 4 below: record number without REGARDING without OUR without ACKNOWLEDGEMENT without CONCERNING R3 GIAD R6 GIAD R12 GIAD R17 GIAD Table 4

在步驟S105中，計算該目標單字列表中對應相同紀錄編號的單字的數量，並將該數量紀錄為對應該紀錄編號的命中字數。舉例而言，該目標單字列表中，對應紀錄編號「R3」的單字為「GIAD」，即對應該紀錄編號「R3」的單字只有1個，因此命中字數紀錄為C1。此一步驟中可將沒有比對到黑名單單字列表中單字的待查找單字移除，例如移除“REGARDING”、“OUR”、“ACKNOWLEDGEMENT”、“CONCERNING”等單字。進一步在該目標單字列表中記錄該命中字數，如下方的表五所示：紀錄編號命中單字命中字數 R3 GIAD C1 R6 GIAD C1 R12 GIAD C1 R17 GIAD C1 表五 In step S105, the number of words corresponding to the same record number in the target word list is calculated, and the number is recorded as the number of hit words corresponding to the record number. For example, in the target word list, the word corresponding to the record number "R3" is "GIAD", that is, there is only one word corresponding to the record number "R3", so the hit word count record is C1. In this step, words to be found that do not match the words in the blacklist word list can be removed, such as "REGARDING", "OUR", "ACKNOWLEDGEMENT", "CONCERNING" and other words. Further record the number of hit words in the target word list, as shown in Table 5 below: record number hit word hit word count R3 GIAD C1 R6 GIAD C1 R12 GIAD C1 R17 GIAD C1 Table 5

接著，在步驟S106至S107中，判斷該組待查找單字中是否包含有一組命中名字資料，其中，示例性地設定該命中範圍為「1」。在上一步驟(S105)判斷完每一紀錄編號對應的命中單字的數量後，進一步根據表一查詢每一紀錄編號對應的字數資料；接著，判斷對應各紀錄編號的「命中字數」與「字數資料」的差值是否小於該命中範圍；若是，表示該紀錄編號對應的名字資料為命中名字資料，該組待查找單字中包含黑名單資料庫中的該筆名字資料。比對結果如下方的表六所示：紀錄編號命中字數字數資料比對差值是否命中 R3 C1 C4 3 否 R6 C1 C4 3 否 R12 C1 C5 4 否 R17 C1 C3 2 否表六 Next, in steps S106 to S107, it is determined whether the group of words to be searched includes a group of hit name data, where the hit range is exemplarily set to "1". After determining the number of hit words corresponding to each record number in the previous step (S105), further query the word count data corresponding to each record number according to Table 1; then, determine the "number of hit words" corresponding to each record number and Is the difference in "word count data" less than the hit range? If so, it means that the name data corresponding to the record number is the hit name data, and the group of words to be searched includes the name data in the blacklist database. The comparison results are shown in Table 6 below: record number hit word count word count information Comparison difference Is it a hit? R3 C1 C4 3 no R6 C1 C4 3 no R12 C1 C5 4 no R17 C1 C3 2 no Table 6

根據表六的「是否命中」欄位可知，由本發明的方法可判斷該待查找文件中的第一組待查找單字（WL1）中不包含有該黑名單資料庫30中的任一筆名字資料，因此第一組待查找單字（WL1）中沒有包含命中名字資料。According to the "hit or not" column in Table 6, the method of the present invention can determine that the first group of words to be searched (WL1) in the file to be searched does not contain any name data in the blacklist database 30. Therefore, the first group of words to be searched (WL1) does not contain hit name data.

當比對完成該第一組待查找單字（WL1）後，該處理模組10依序對該待查找列表中的第二組、第三組….待查找單字（WL2、WL3、…）進行比對，直到完成每一組待查找單字的比對。After the comparison is completed on the first group of words to be found (WL1), the processing module 10 sequentially performs the comparison on the second group, the third group... of the words to be found (WL2, WL3,...) in the list to be found. Compare until the comparison of each set of words to be found is completed.

以下再以第四組待查找單字（WL4）為例說明比對方式，其中，第四組待查找單字（WL4）的內容為「CONCERNING,GIAD,HEAVY,INDUSTRIES,COMPLEX｣。在步驟S104中，根據該第四組待查找單字（WL4）建立的目標單字列表如下：紀錄編號無 CONCERNING R3 GIAD R6 GIAD R12 GIAD R17 GIAD R1 HEAVY R2 HEAVY R3 HEAVY R12 HEAVY R13 HEAVY R16 HEAVY R19 HEAVY R20 HEAVY R21 HEAVY R1 INDUSTRIES R2 INDUSTRIES R3 INDUSTRIES R4 INDUSTRIES R5 INDUSTRIES R8 INDUSTRIES R14 INDUSTRIES R17 INDUSTRIES R22 INDUSTRIES R23 INDUSTRIES R25 INDUSTRIES R26 INDUSTRIES R27 INDUSTRIES R28 INDUSTRIES R29 INDUSTRIES R30 INDUSTRIES R3 COMPLEX R6 COMPLEX R7 COMPLEX R9 COMPLEX R10 COMPLEX R11 COMPLEX R13 COMPLEX R15 COMPLEX R16 COMPLEX R18 COMPLEX R22 COMPLEX R24 COMPLEX The following uses the fourth group of words to be searched for (WL4) as an example to illustrate the comparison method. The content of the fourth group of words to be searched for (WL4) is "CONCERNING, GIAD, HEAVY, INDUSTRIES, COMPLEX". In step S104, the target word list established based on the fourth group of words to be found (WL4) is as follows: record number without CONCERNING R3 GIAD R6 GIAD R12 GIAD R17 GIAD R1 HEAVY R2 HEAVY R3 HEAVY R12 HEAVY R13 HEAVY R16 HEAVY R19 HEAVY R20 HEAVY R21 HEAVY R1 INDUSTRIES R2 INDUSTRIES R3 INDUSTRIES R4 INDUSTRIES R5 INDUSTRIES R8 INDUSTRIES R14 INDUSTRIES R17 INDUSTRIES R22 INDUSTRIES R23 INDUSTRIES R25 INDUSTRIES R26 INDUSTRIES R27 INDUSTRIES R28 INDUSTRIES R29 INDUSTRIES R30 INDUSTRIES R3 COMPLEX R6 COMPLEX R7 COMPLEX R9 COMPLEX R10 COMPLEX R11 COMPLEX R13 COMPLEX R15 COMPLEX R16 COMPLEX R18 COMPLEX R22 COMPLEX R24 COMPLEX

根據步驟S105，計算該目標單字列表中對應相同紀錄編號的單字的數量。舉例而言，該目標單字列表中，對應紀錄編號「R2」的單字為“HEAVY”及“INDUSTRIES”，即對應該紀錄編號「R2」的單字有2個，因此命中字數紀錄為C2；對應紀錄編號「R3」的單字為“GIAD”、“HEAVY”及“INDUSTRIES”及“COMPLEX”，即對應該記錄編號「R3」的單字有4個，因此命中字數紀錄為C4。紀錄該命中字數資料的目標單字列表如下：紀錄編號命中單字命中字數 R1 HEAVY,INDUSTRIES C2 R2 HEAVY,INDUSTRIES C2 R3 GIAD,HEAVY,INDUSTRIES, COMPLEX C4 R4 INDUSTRIES C1 R5 INDUSTRIES C1 R6 GIAD, COMPLEX C2 R7 COMPLEX C1 R8 INDUSTRIES C1 R9 COMPLEX C1 R10 COMPLEX C1 R11 COMPLEX C1 R12 GIAD,HEAVY C2 R13 HEAVY,COMPLEX C2 R14 INDUSTRIES C1 R15 COMPLEX C1 R16 HEAVY, COMPLEX C2 R17 GIAD,INDUSTRIES C2 R18 COMPLEX C1 R19 HEAVY C1 R20 HEAVY C1 R21 HEAVY C1 R22 INDUSTRIES,COMPLEX C2 R23 INDUSTRIES C1 R24 COMPLEX C1 R25 INDUSTRIES C1 R26 INDUSTRIES C1 R27 INDUSTRIES C1 R28 INDUSTRIES C1 R29 INDUSTRIES C1 R30 INDUSTRIES C1 According to step S105, the number of words corresponding to the same record number in the target word list is calculated. For example, in the target word list, the words corresponding to the record number "R2" are "HEAVY" and "INDUSTRIES", that is, there are 2 words corresponding to the record number "R2", so the hit word count record is C2; corresponding The words of the record number "R3" are "GIAD", "HEAVY", "INDUSTRIES" and "COMPLEX", that is, there are 4 words corresponding to the record number "R3", so the number of hit words is recorded as C4. The target word list recording the hit word count data is as follows: record number hit word hit word count R1 HEAVY,INDUSTRIES C2 R2 HEAVY,INDUSTRIES C2 R3 GIAD, HEAVY, INDUSTRIES, COMPLEX C4 R4 INDUSTRIES C1 R5 INDUSTRIES C1 R6 GIAD, COMPLEX C2 R7 COMPLEX C1 R8 INDUSTRIES C1 R9 COMPLEX C1 R10 COMPLEX C1 R11 COMPLEX C1 R12 GIAD,HEAVY C2 R13 HEAVY,COMPLEX C2 R14 INDUSTRIES C1 R15 COMPLEX C1 R16 HEAVY, COMPLEX C2 R17 GIAD,INDUSTRIES C2 R18 COMPLEX C1 R19 HEAVY C1 R20 HEAVY C1 R21 HEAVY C1 R22 INDUSTRIES,COMPLEX C2 R23 INDUSTRIES C1 R24 COMPLEX C1 R25 INDUSTRIES C1 R26 INDUSTRIES C1 R27 INDUSTRIES C1 R28 INDUSTRIES C1 R29 INDUSTRIES C1 R30 INDUSTRIES C1

在步驟S106至S107中，判斷該組待查找單字中是否包含有一組命中名字資料。同樣的，查詢目標單字列表中對應每一紀錄編號的字數資料，以判斷對應各紀錄編號的「命中字數」與「字數資料」的差值是否小於該命中範圍「1」，若是，則判斷該紀錄編號對應的名字資料為命中名字資料。完整比對結果列表如下表所示：紀錄編號命中字數字數資料比對差值是否命中 R1 C2 C6 4 否 R2 C2 C6 4 否 R3 C4 C4 0 是 R4 C1 C3 2 否 R5 C1 C4 3 否 R6 C2 C4 2 否 R7 C1 C3 2 否 R8 C1 C3 2 否 R9 C1 C3 2 否 R10 C1 C6 5 否 R11 C1 C4 3 否 R12 C2 C5 3 否 R13 C2 C5 3 否 R14 C1 C3 2 否 R15 C1 C3 2 否 R16 C2 C3 1 否 R17 C2 C3 1 否 R18 C1 C12 11 否 R19 C1 C3 2 否 R20 C1 C4 3 否 R21 C1 C9 8 否 R22 C1 C7 6 否 R23 C1 C8 7 否 R24 C1 C9 8 否 R25 C1 C8 7 否 R26 C1 C2 1 否 R27 C1 C4 3 否 R28 C1 C7 6 否 R29 C1 C3 2 否 R30 C1 C4 3 否 In steps S106 to S107, it is determined whether the group of words to be searched includes a group of hit name data. Similarly, query the word count data corresponding to each record number in the target word list to determine whether the difference between the "hit word count" and the "word count data" corresponding to each record number is less than the hit range "1". If so, Then it is determined that the name data corresponding to the record number is the hit name data. The complete list of comparison results is shown in the table below: record number hit word count word count data Comparison difference Is it a hit? R1 C2 C6 4 no R2 C2 C6 4 no R3 C4 C4 0 yes R4 C1 C3 2 no R5 C1 C4 3 no R6 C2 C4 2 no R7 C1 C3 2 no R8 C1 C3 2 no R9 C1 C3 2 no R10 C1 C6 5 no R11 C1 C4 3 no R12 C2 C5 3 no R13 C2 C5 3 no R14 C1 C3 2 no R15 C1 C3 2 no R16 C2 C3 1 no R17 C2 C3 1 no R18 C1 C12 11 no R19 C1 C3 2 no R20 C1 C4 3 no R21 C1 C9 8 no R22 C1 C7 6 no R23 C1 C8 7 no R24 C1 C9 8 no R25 C1 C8 7 no R26 C1 C2 1 no R27 C1 C4 3 no R28 C1 C7 6 no R29 C1 C3 2 no R30 C1 C4 3 no

該第四組待查找單字（WL4）的內容為｢CONCERNING,GIAD,HEAVY,INDUSTRIES,COMPLEX｣，即由該待查找文件中的第4個開始的連續5個待查找單字中，共包含了4個與紀錄編號為R3的名字資料「GIAD HEAVY INDUSTRIES COMPLEX」相同的待查找單字，且其命中字數C4與R3對應的字數資料C4差值為0，小於命中範圍1，因此判斷該R3對應的該筆名字資料為命中名字資料。The content of the fourth group of words to be searched (WL4) is "CONCERNING, GIAD, HEAVY, INDUSTRIES, COMPLEX", that is, the 5 consecutive words to be searched starting from the 4th one in the file to be searched include a total of 4 The word to be searched is the same as the name data "GIAD HEAVY INDUSTRIES COMPLEX" with record number R3, and the difference between the hit number C4 and the word number data C4 corresponding to R3 is 0, which is less than the hit range 1, so it is judged that R3 corresponds The name data is the hit name data.

綜上所述，本發明的於文件中快速查找預設名單之方法及系統能夠在一待查找文件中，由文件起始單字開始每次抓取特定長度的連續複數待查找單字，依序掃描全篇文件，依各待查找單字與事先建立好的黑名單單字列表依序比對單字，比對完成後計算對應相同紀錄編號的命中字數，再將命中字數與該筆名字資料的字數資料比對，而得到是否命中之結果。方法過程中是藉由先建立好依名字資料中的單字值排列的該黑名單單字列表，在真正進行比對判斷是否為命中名字資料時，僅須比對單字是否相同，以及簡單比較命中字數與字數資料，執行運算負擔低且速度快。且由於比對待查找單字及黑名單單字列表時是單字分別之比對，無論各組待查找單字中包含的命中名字的單字排列與原始黑名單資料庫中的名字資料中的單字順序是否相同，皆能夠找出與名字資料差異在一定範圍內的所有待查找單字，解決由整篇文件中連續多個不特定文字無法模糊比對黑名單資料庫中的名字資料之問題。In summary, the method and system for quickly searching for a preset list in a file of the present invention can capture consecutive plural words of a specific length to be searched starting from the starting word of the file in a file to be searched, and scan them in sequence. In the entire document, each word to be searched is compared with the pre-established blacklist word list in order. After the comparison is completed, the number of hit words corresponding to the same record number is calculated, and then the number of hit words is compared with the number of words in the name data. Compare the data and get the result of whether it is hit or not. The process of the method is to first establish the blacklist word list arranged according to the word value in the name data. When actually comparing to determine whether it is a hit name data, it only needs to compare whether the words are the same and simply compare the hit words. Numerical and word-count data, the execution burden is low and the speed is fast. And since the comparison of the words to be searched and the list of blacklisted words is a comparison of individual words, no matter whether the character arrangement of the hit names contained in each group of words to be searched is the same as the order of the words in the name data in the original blacklist database, It can find all the words to be searched that differ from the name data within a certain range, solving the problem of being unable to fuzzy compare the name data in the blacklist database with multiple consecutive unspecific words in the entire document.

以上所述僅是本發明的實施例而已，並非對本發明做任何形式上的限制，雖然本發明已以實施例揭露如上，然而並非用以限定本發明，任何熟悉本專業的技術人員，在不脫離本發明技術方案的範圍內，當可利用上述揭示的技術內容做出些許更動或修飾為等同變化的等效實施例，但凡是未脫離本發明技術方案的內容，依據本發明的技術實質對以上實施例所作的任何簡單修改、等同變化與修飾，均仍屬於本發明技術方案的範圍內。The above descriptions are only embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed in the embodiments above, they are not used to limit the present invention. Any skilled person familiar with the art will not Without departing from the scope of the technical solution of the present invention, the technical content disclosed above can be used to make some changes or modifications to equivalent embodiments with equivalent changes. Any simple modifications, equivalent changes and modifications made to the above embodiments still fall within the scope of the technical solution of the present invention.

10:處理模組 20:儲存模組 30:黑名單資料庫 10: Processing module 20:Storage module 30: Blacklist database

圖1係本發明於文件中快速查找預設名單之方法的一判斷流程圖。圖2係本發明於文件中快速查找預設名單之系統的一方塊示意圖。圖3係本發明於文件中快速查找預設名單之方法的一實施例的部分的判斷流程圖。圖4係本發明於文件中快速查找預設名單之方法的一實施例的部分的判斷流程圖。 Figure 1 is a judgment flow chart of the method of quickly searching for a preset list in a file according to the present invention. Figure 2 is a block diagram of a system for quickly searching for a preset list in a document according to the present invention. FIG. 3 is a partial judgment flow chart of an embodiment of a method for quickly searching for a preset list in a file according to the present invention. FIG. 4 is a partial judgment flow chart of an embodiment of a method for quickly searching for a preset list in a file according to the present invention.

Claims

A method to quickly find the default list in a file includes the following steps: Read a blacklist database, the blacklist database contains plural record numbers, and plural name data and plural numeric data sorted according to each record number, wherein each name data contains plural single characters and each word number data record The number of words in each name data; Create a blacklist word list based on the blacklist database. The blacklist word list includes each word in the name data and the record number corresponding to the name data to which the word belongs; Receive a file to be searched, and read a group of consecutive plural words to be searched in the file to be searched; Based on the group of consecutive plural words to be searched compared to the blacklist word list, at least one word that is the same as any word to be searched and at least one record number corresponding to the at least one word is established as a target word list; Calculate the number of words corresponding to the same record number in the target word list, and record the number as the number of at least one hit word corresponding to the at least one record number; Compare the at least one hit word count with the word count data corresponding to the at least one record number, and determine whether the difference between the at least one hit word count and the at least one word count data is within a hit range; If the difference between one of the hit word numbers and the corresponding word number data is within the hit range, the name data is a hit name data; If not, the name data corresponding to the record number is not the hit name data; Complete the comparison of the set of words to be found.

A method to quickly search for a default list in a document as described in request 1, where: The words in the blacklist word list are sorted according to the value of the words, and the same words are arranged together.

A method for quickly searching a preset list in a file as described in request item 1, wherein when receiving the file to be searched, a list to be searched for is first created, and the list to be searched for includes a plurality of consecutive groups of files in the file to be searched. plural words to be found, and when the comparison of a group of words to be found is completed, the system moves to the next group of words to be found in the list to be found, and the comparison is performed again until each group of words to be found in the list is completed. Find word comparisons.

The method of quickly searching for a preset list in a file is as described in request 3, wherein the list to be searched is created according to the following sub-steps: Starting from the first word to be found in the file to be found, read a number of consecutive single comparison words to be found, and store them in the to-be-searched list; Starting from the second word to be found among the plural words to be found read in the previous step, read the number of words to be found for a continuous single comparison and store them in the to-be-searched list; Repeat the previous step until the last word of the plurality of words to be searched is the last word to be searched in the file to be searched.

A method for quickly searching a preset list in a file as described in request 4, wherein when reading a number of consecutive words to be searched for in a single comparison, it is also determined whether the plurality of words to be searched include consecutive and identical words. The two words to be searched for; If there is, then ignore one of the consecutive and identical words to be found, and update and read the next word to be found of the consecutive plural words to be found in the file to be found, and update the next word to be found to the The set of search words in the search list.

The method of quickly searching for a default list in a document is as described in request 1, wherein the step of creating the blacklist word list includes the following sub-steps: Read each word in the name data sequentially according to the record number, and temporarily save it as a full word list; Each word in the full word list is rearranged according to the value of the word and stored as the blacklist word list.

A method for quickly searching a default list in a document as described in request 6, wherein when creating the blacklist word list, each name data is first compared against a common word list containing a plurality of commonly used words. Common words contained in each name data are removed, and then the blacklist database is created based on the name data that does not contain common words.

A system for quickly searching for a default list in a document, connected to a blacklist database. The blacklist database contains plural name data sorted according to plural record numbers, and plural numeric data corresponding to each name data, each of which The name data contains plural characters, and each word number data records the number of single characters contained in each name data; the system for quickly searching the default list in the document includes: A processing module connects to the blacklist database; A storage module connected to the processing module; wherein the processing module creates a blacklist word list based on the blacklist database, and stores the blacklist word list in the storage module; the blacklist word list includes Each word in the name data, and the record number corresponding to the name data to which each word belongs; The processing module receives a file to be searched, stores the file to be searched in the storage module, and reads a group of consecutive plural words to be searched in the file to be searched; The processing module compares the blacklist word list according to the group of consecutive plural words to be searched, and establishes at least one word that is the same as any word to be searched and at least one record number corresponding to the at least one word as a target word. list, and store the target word list in the storage module; The processing module calculates the number of words corresponding to the same record number in the target word list, and records the number as the number of at least one hit word corresponding to the at least one record number; The processing module compares the at least one hit word count with the word count data corresponding to the at least one record number, and determines whether the difference between the at least one hit word count and the at least one word count data is within a hit range; If the difference between one of the hit word numbers and the corresponding word number data is within the hit range, the processing module determines that the name data is a hit name data; If not, the processing module determines that the name data corresponding to the record number is not a hit name data; Complete the comparison of the set of words to be found.

A system for quickly searching a preset list in a document as described in request 8, wherein the words in the blacklist word list are sorted according to the value of each word, and the same words are arranged together.

A system for quickly searching a preset list in a file as described in request item 8, wherein when the processing module receives the file to be searched, it first creates a list to be searched, and the list to be searched includes a plurality of groups of the files to be searched. Continuous plural words to be found in the file, and when the comparison of a group of words to be found is completed, the system moves to the next group of words to be found in the list to be found, and the comparison is performed again until the list to be found is completed. Comparison of each group of words to be found.

A system for quickly searching for a preset list in a file as described in claim 10, wherein when the processing module creates the search list, it first starts with the first word to be searched in the file to be searched, and reads Continuous single comparison of a number of words to be found and stored in the to-be-searched list; Starting from the second word to be found among the plural words to be found read in the previous step, read the number of words to be found for a continuous single comparison and store them in the to-be-searched list; Repeat the previous step until the last word of the plurality of words to be searched is the last word to be searched in the file to be searched.

A system for quickly searching for a preset list in a file as described in request 11, wherein when reading a continuous single comparison of a number of words to be searched for, the processing module also determines whether the plurality of words to be searched for contains There are two consecutive and identical words to be searched; If so, the processing module ignores one of the consecutive and identical words to be searched, and updates and reads the next word to be searched for the plurality of consecutive words to be searched in the file to be searched, and replaces the next word to be searched. The words are updated to the group of words to be found in the list to be found.

A system for quickly searching for a default list in a document as described in request item 8, wherein when the processing module creates the blacklist word list, the processing module first reads the names in sequence according to the record numbers. The words in the data are temporarily stored in the storage module as a full word list; then each word in the full word list is rearranged according to the word value and stored in the storage module as a blacklist single word list.

A system for quickly searching for a default list in a document as described in request 8, wherein when creating the blacklist word list, the processing module first compares each word according to a common word list containing a plurality of common words. From the name data, common words contained in the name data are removed, and then the blacklist database is created based on the name data that does not contain common words.