TWI816141B - System and method for quickly searching for default lists in documents - Google Patents
System and method for quickly searching for default lists in documents Download PDFInfo
- Publication number
- TWI816141B TWI816141B TW110122165A TW110122165A TWI816141B TW I816141 B TWI816141 B TW I816141B TW 110122165 A TW110122165 A TW 110122165A TW 110122165 A TW110122165 A TW 110122165A TW I816141 B TWI816141 B TW I816141B
- Authority
- TW
- Taiwan
- Prior art keywords
- word
- words
- list
- searched
- found
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 28
- 238000012545 processing Methods 0.000 claims abstract description 62
- 238000010276 construction Methods 0.000 description 8
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 239000000126 substance Substances 0.000 description 4
- 230000008676 import Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- XLYOFNOQVPJJNP-ZSJDYOACSA-N Heavy water Chemical compound [2H]O[2H] XLYOFNOQVPJJNP-ZSJDYOACSA-N 0.000 description 2
- 229910000831 Steel Inorganic materials 0.000 description 2
- 229910052742 iron Inorganic materials 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- 239000010959 steel Substances 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000446 fuel Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000003758 nuclear fuel Substances 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 210000003813 thumb Anatomy 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本發明的於文件中快速查找預設名單之系統用於在一具有多數連續的待查找文字的待查找文件中找到可能包含在其中的黑名單資料庫名字資料;處理模組先將黑名單資料庫中的所有名字資料的單字建為黑名單單字列表;再讀取待查找文件中連續的複數待查找單字,比對找出黑名單單字列表所有與其中一待查找文字相符的單字並建為目標單字列表,再計算目標單字列表中屬於同一名字資料的單字數量為命中字數,比對該命中字數與該名字資料的單字數量,若差值在一命中範圍內則判斷命中該名字資料,達到在待查找文件中找出不特定名字資料之目的。The system for quickly searching for a preset list in a file of the present invention is used to find the blacklist database name data that may be included in a search file with a plurality of consecutive words to be searched; the processing module first converts the blacklist data All the words in the name data in the database are built into a blacklist word list; then the consecutive plural words to be found in the file to be searched are read, and all the words in the blacklist word list that match one of the words to be searched are found and built as Target word list, then calculate the number of words belonging to the same name data in the target word list as the number of hit words, compare the number of hit words with the number of words of the name data, if the difference is within a hit range, it is determined that the name data is hit , to achieve the purpose of finding information with unspecified names in the files to be searched.
Description
一種查找系統及方法,尤指一種於文件中快速查找預設名單之系統及方法。A search system and method, especially a system and method for quickly searching for a default list in a document.
在全球化的今日,世界各地的金融機構之間往往都有一定程度的往來。為了便於進行溝通,電文系統是金融機構之間傳遞資訊的重要方式。一封電文除了標準規格化的封包形式,其中的主文則無特殊限制,其敘述方式、字數長短等各異,而其中的一主要關鍵,是該主文中可能出現具有不良信用紀錄,或曾出現於不良新聞消息中的人名、公司行號或機關單位之名字。In today's world of globalization, financial institutions around the world often have a certain degree of contact with each other. In order to facilitate communication, the message system is an important way to transmit information between financial institutions. In addition to the standard format of a message, there are no special restrictions on the main text of a message. Its narrative method, word length, etc. are different. One of the main keys is that the main text may appear to have a bad credit record or have had a bad credit record. Names of people, companies, or institutions that appear in bad news.
金融機構一般來說會使用全球性的黑名單資料庫進行檢核,建立檢索系統以在需要時通過該檢索系統搜尋黑名單資料庫的人物資訊。該黑名單資料庫一般來說是關聯式資料庫,當使用者在檢索系統輸入欲搜尋的目標姓名字串後,檢索系統將該目標姓名字串與該黑名單資料庫中的名單列表進行精確比對,並尋找完全符合該目標姓名字串的姓名欄位內容,並在找到相符合的欄位內容後,讀取並輸出該欄位的相關資訊以供使用者瀏覽。Financial institutions generally use global blacklist databases for verification and establish retrieval systems to search for personal information in the blacklist database when necessary. Generally speaking, the blacklist database is a relational database. When the user inputs the target name string to be searched into the retrieval system, the retrieval system will accurately compare the target name string with the list in the blacklist database. Compare and find the name field content that exactly matches the target name string, and after finding the matching field content, read and output the relevant information of the field for the user to browse.
然而,該檢索系統必須由使用者輸入一待檢索名字並實施精確比對,只能檢索出完全符合輸入的待檢索名字的內容。由於檢索系統是將輸入文字依序比對黑名單資料庫中的資訊以進行檢索,當可能為黑名單的名字存在於一整段的不特定文字中,輸入整段文字於檢索系統中進行比對是無意義的,因為檢索系統無法在全篇文字中偵測或提取可能為目標名字的部分文字。當金融機構接收到一封電文時,因為該電文中的主文是連續的且內容無特定格式的連續文句,難以直接判斷主文中何處出現人名、公司行號或機關單位之名字,因此無法利用一般的檢索系統進行黑名單名字的查找。However, this search system requires the user to input a name to be searched and perform precise comparison, and can only retrieve content that completely matches the entered name to be searched. Since the retrieval system sequentially compares the input text with the information in the blacklist database for retrieval, when a name that may be on the blacklist exists in an entire paragraph of unspecified text, the entire paragraph of text is entered for comparison in the retrieval system. The pair is meaningless because the retrieval system cannot detect or extract part of the text that may be the target name in the entire text. When a financial institution receives a message, it is difficult to directly determine where a person's name, company number, or the name of an institution appears in the main message because the main text in the message is continuous and the content has no specific format. Therefore, it cannot be used. General search systems perform blacklist name searches.
此外,該等檢索系統不僅無法自動提取隱藏在整段文字中的名字,同一個人、公司行號或機關單位的名字可能有多種形式,例如姓、名的前後置換、稱謂等。例如姓名「王曉明」在一封電文中可能以「曉明,王」、「王先生曉明」等不同形式出現,更增加了從一篇隨機的電文中找到可能為名字的多個單字並再進一步進行黑名單資料庫檢索的難度。因此,現有的檢索系統尚有改善的空間。In addition, not only are these retrieval systems unable to automatically extract names hidden in the entire text, but the name of the same person, company, bank, or institution may have multiple forms, such as surnames, prefixes, titles, etc. For example, the name "Wang Xiaoming" may appear in different forms in a message such as "Xiaoming, Wang", "Mr. Wang Xiaoming", etc. In addition, it is possible to find multiple words that may be names from a random message and then re- Further difficulty in conducting blacklist database searches. Therefore, there is still room for improvement in the existing retrieval system.
有鑑於現有的黑名單資料庫檢索系統無法有效率的對整封文件進行名單檢索以找出目標名字,本發明提供一種於文件中快速查找預設名單之方法及系統,該方法包含以下步驟: 讀取一黑名單資料庫,該黑名單資料庫包含複數紀錄編號、依照各該紀錄編號排序的複數名字資料、複數字數資料,其中各該名字資料包含複數單字,各該字數資料紀錄各該名字資料的單字的數量; 根據該黑名單資料庫建立一黑名單單字列表,該黑名單單字列表包含各該名字資料中的每一單字,以及各該單字所屬的名字資料對應的紀錄編號; 接收一待查找文件,讀取該待查找文件中的一組連續的複數個待查找單字; 根據該組連續的複數個待查找單字比對該黑名單單字列表,將與任一待查找單字相同的至少一單字及該至少一單字對應的至少一紀錄編號建立為一目標單字列表; 計算該目標單字列表中對應相同紀錄編號的單字的數量,並將該數量紀錄為對應該至少一紀錄編號的至少一命中字數; 比較該至少一命中字數與對應的該至少一紀錄編號對應的字數資料,判斷該至少一命中字數與該至少一字數資料的差值是否在一命中範圍內; 若其中一命中字數與對應的該字數資料的差值在該命中範圍內,該名字資料為一命中名字資料; 若否,該紀錄編號對應的名字資料非命中名字資料; 完成該組待查找單字的比對。 In view of the fact that the existing blacklist database retrieval system cannot efficiently perform a list search on the entire document to find the target name, the present invention provides a method and system for quickly searching for a preset list in a document. The method includes the following steps: Read a blacklist database. The blacklist database contains plural record numbers, plural name data sorted according to each record number, and plural numeric data. Each of the name data includes plural single characters, and each of the word data records contains a plurality of characters. The number of characters in the name data; Create a blacklist word list based on the blacklist database. The blacklist word list includes each word in the name data and the record number corresponding to the name data to which each word belongs; Receive a file to be searched, and read a group of consecutive plural words to be searched in the file to be searched; Based on the group of consecutive plural words to be searched compared to the blacklist word list, at least one word that is the same as any word to be searched and at least one record number corresponding to the at least one word is established as a target word list; Calculate the number of words corresponding to the same record number in the target word list, and record the number as the number of at least one hit word corresponding to the at least one record number; Compare the at least one hit word count with the word count data corresponding to the at least one record number, and determine whether the difference between the at least one hit word count and the at least one word count data is within a hit range; If the difference between one of the hit word numbers and the corresponding word number data is within the hit range, the name data is a hit name data; If not, the name data corresponding to the record number is not the hit name data; Complete the comparison of the set of words to be found.
此外,本發明還提供一種於文件中快速查找預設名單之系統,包含有: 一處理模組,連接該黑名單資料庫; 一儲存模組,連接該處理模組;其中,該處理模組根據該黑名單資料庫建立一黑名單單字列表,並將該黑名單單字列表儲存於該儲存模組;該黑名單單字列表包含各該名字資料中的每一單字,以及各該單字所屬的名字資料對應的紀錄編號; 該處理模組接收一待查找文件,將該待查找文件儲存於該儲存模組,並讀取該待查找文件中的一組連續的複數個待查找單字; 該處理模組根據該組連續的複數個待查找單字比對該黑名單單字列表,將與任一待查找單字相同的至少一單字及該至少一單字對應的至少一紀錄編號建立為一目標單字列表,並將該目標單字列表儲存於該儲存模組; 該處理模組計算該目標單字列表中對應相同紀錄編號的單字的數量,並將該數量紀錄為對應該至少一紀錄編號的至少一命中字數; 該處理模組比較該至少一命中字數與對應的該至少一紀錄編號對應的字數資料,判斷該至少一命中字數與該至少一字數資料的差值是否在一命中範圍內; 若其中一命中字數與對應的該字數資料的差值在該命中範圍內,該處理模組判斷該名字資料為一命中名字資料; 若否,該處理模組判斷該紀錄編號對應的名字資料非命中名字資料; 完成該組待查找單字的比對。 In addition, the present invention also provides a system for quickly searching for a default list in a file, including: A processing module connects to the blacklist database; A storage module connected to the processing module; wherein the processing module creates a blacklist word list based on the blacklist database, and stores the blacklist word list in the storage module; the blacklist word list includes Each word in the name data, and the record number corresponding to the name data to which each word belongs; The processing module receives a file to be searched, stores the file to be searched in the storage module, and reads a group of consecutive plural words to be searched in the file to be searched; The processing module compares the blacklist word list according to the group of consecutive plural words to be searched, and establishes at least one word that is the same as any word to be searched and at least one record number corresponding to the at least one word as a target word. list, and store the target word list in the storage module; The processing module calculates the number of words corresponding to the same record number in the target word list, and records the number as the number of at least one hit word corresponding to the at least one record number; The processing module compares the at least one hit word count with the word count data corresponding to the at least one record number, and determines whether the difference between the at least one hit word count and the at least one word count data is within a hit range; If the difference between one of the hit word numbers and the corresponding word number data is within the hit range, the processing module determines that the name data is a hit name data; If not, the processing module determines that the name data corresponding to the record number is not a hit name data; Complete the comparison of the set of words to be found.
本發明的查找預設名單方法係先將黑名單資料庫中的所有名字資料的所有單字另建為該黑名單單字列表,並對應每一單字紀錄所屬名字資料包含的字數。當收到待查找文件時,讀取其中一組連續的複數的待查找文字,同樣分別根據該等待查找文字對該黑名單單字列表進行查找,將相同的單字,包含黑名單單字列表中重複但對應不同紀錄編號的所有單字,建立為目標單字列表,然後計算該目標單字列表中對應同一紀錄編號,即屬於同一筆名字資料的單字的數量,並將該等命中字數對應每一筆紀錄編號紀錄於該目標單字列表中。最後,比對該目標單字列表中每一筆紀錄編號的命中字數及在黑名單資料庫中的字數資料,若兩者的差在一命中範圍內,表示該複數待查找文字中有多個單字與一名字資料的單字相同,該複數待查找文字中有很高機率包含了該筆名字資料,因此判斷該名字資料為命中名字資料。The method for searching the default list of the present invention first constructs all the words of all the name data in the blacklist database into the blacklist word list, and records the number of words contained in the name data to which each word belongs. When the file to be searched is received, a group of consecutive plural words to be searched are read, and the blacklist word list is also searched based on the words to be searched, and the same words, including those that are repeated in the blacklist word list, are All words corresponding to different record numbers are created as a target word list, and then the number of words in the target word list corresponding to the same record number, that is, the same name data, is calculated, and the number of hit words is corresponding to each record number record. in the target word list. Finally, compare the number of hit words for each record number in the target word list with the number of words in the blacklist database. If the difference between the two is within a hit range, it means that there are multiple words in the plural words to be searched. The single word is the same as the single word of a name data. There is a high probability that the plural text to be searched contains the name data. Therefore, the name data is judged to be the hit name data.
舉例而言,黑名單資料庫中的名字為「王曉明」,字數為3,其對應的字數資料為「3」;由待查找文件中讀取的該複數待查找單字為「王先生曉明」,經過上述查找方式會判斷該複數待查找單字中的命中字數為3,命中字數與字數資料差值為0,因此判斷該名字資料「王曉明」為命中名字資料。For example, the name in the blacklist database is "Wang Xiaoming", the number of characters is 3, and the corresponding character number data is "3"; the plural word to be searched read from the file to be searched is "Mr. Wang Xiaoming" "Ming", after the above search method, it will be judged that the number of hit words in the plural word to be searched is 3, and the difference between the number of hit words and the word number data is 0, so the name data "Wang Xiaoming" is judged to be the hit name data.
本發明的方法及系統藉由將黑名單資料庫中的所有名字資料拆解為黑名單單字列表,並將待查找文字中的複數待查找單字分別比對,並且以命中字數的方式判斷該複數待查找文字中是否可能包含有一筆名字資料。由於此一查找方法不限制該複數待查找單字或黑名單資料庫中名字資料的單字的順序,因此無論該名字資料的單字順序與該複數待查找單字中包含的名字單字順序是否相同,或其中是否插入冗字,其中包含的名字皆能夠被找出來,解決習知黑名單資料庫檢索系統無法於整篇文件中找出預設名單中的名字資料的問題。The method and system of the present invention decompose all the name data in the blacklist database into a list of blacklist words, compare the plural words to be found in the text to be found, and judge the words based on the number of hits. Whether the plural text to be searched for may contain a name data. Since this search method does not limit the order of the words in the plural word to be searched or the name data in the blacklist database, it does not matter whether the order of the words in the name data is the same as the order of the name words contained in the plural word to be searched, or whether there is Whether redundant words are inserted or not, the names included in them can all be found, which solves the problem that the conventional blacklist database search system cannot find the name data in the default list in the entire document.
以下配合圖式及本發明的實施例,進一步闡述本發明為達成預定發明目的所採取的技術手段。The technical means adopted by the present invention to achieve the intended invention purpose will be further described below with reference to the drawings and embodiments of the present invention.
請參閱圖1及圖2所示,本發明的於文件中快速查找預設名單之系統包含一處理模組10及一儲存模組20,該處理模組10連接一黑名單資料庫30,該處理模組10接收一待查找文件,並執行本發明的於文件中快速查找預設名單之方法。該儲存模組20連接該處理模組10,用於儲存或暫存該待查找文件、黑名單單字列表及目標單字列表。該處理模組10例如是一伺服器、一個人電腦等電子計算機裝置之主要處理元件;該儲存模組20是一儲存裝置,例如一傳統硬碟(HDD)、一固態硬碟(SSD)等,且較佳的,對該處理模組10而言是一近端儲存裝置;該黑名單資料庫30例如是建立於一雲端伺服器,以供管理單位隨時進行更新,而該處理模組10通過網際網路連接並讀取該黑名單資料庫30,並據以建立該黑名單單字列表,儲存於本地的儲存模組20中。Referring to Figures 1 and 2, the system for quickly searching for a default list in a document of the present invention includes a
該待查找文件例如是一銀行單位或金管單位透過網際網路接收到的由其他相關單位發出的一封電文,其中包含了不定長度及無特定規格的內容文字。當該處理模組10接收到該待查找文件時,暫存於該儲存模組20中。The document to be searched is, for example, a message received by a banking unit or financial management unit through the Internet and sent by other relevant units, which contains content text of indefinite length and no specific specifications. When the
本發明的於文件中快速查找預設名單之方法包含以下步驟:
處理模組10讀取一黑名單資料庫30,該黑名單資料庫30包含複數紀錄編號、依照各該紀錄編號排序的複數名字資料及複數字數資料(S101);其中,各該名字資料包含複數單字,各該字數資料紀錄各該名字資料的單字的數量;
處理模組10根據該黑名單資料庫30建立一黑名單單字列表(S102);該黑名單單字列表包含各該名字資料中的每一單字,以及單字所屬的名字資料對應的紀錄編號;
處理模組10接收一待查找文件,讀取該待查找文件中的一組連續的複數個待查找單字(S103);換言之,該待查找文件包含連續的複數個原始單字,該組連續的複數個待查找單字是一字集,其為該待查找文件中之部分連續原始單字,本發明的實施例中,該處理模組10根據一單次比對字數從該待查找文件讀取出該組連續的複數個待查找單字,也就是說,該組連續的複數個待查找單字的字數等於該單次比對字數,該單次比對字數為一預設值;
處理模組10根據各該待查找單字比對該黑名單單字列表,將與任一待查找單字相同的至少一單字及該至少一單字對應的至少一紀錄編號建立為一目標單字列表(S104);該目標單字列表儲存於該儲存模組20中;
處理模組10計算該目標單字列表中對應相同紀錄編號的單字的數量,並將該數量紀錄為對應該至少一紀錄編號的至少一命中字數(S105);
處理模組10比較該至少一命中字數與對應的該至少一紀錄編號對應的字數資料,判斷該至少一命中字數與該至少一字數資料的差值是否在一命中範圍內(S106);該命中範圍為一使用者預設值,命中字數與字數資料之差值在命中範圍內是指該差值小於該命中範圍之數值;
若其中一命中字數與對應的該字數資料的差值在該命中範圍內,處理模組10判斷該名字資料為一命中名字資料(S107);
若否,處理模組10判斷該紀錄編號對應的名字資料非命中名字資料;
完成該組待查找單字的比對(S108)。
The method of quickly searching for a preset list in a file according to the present invention includes the following steps:
The
請參閱圖3所示,在本發明的一實施例中,該黑名單單字列表中的單字是依照各單字的值排序,使得相同的單字排列在一起。其中,在建立該黑名單單字列表的步驟(S102)中,可根據以下子步驟進行:
處理模組10依據該黑名單資料庫30中的各紀錄編號,依序讀取各名字資料中的每一單字,暫存為一全單字列表(S1021);該全單字列表中的各單字依照各紀錄編號及名字資料中的單字順序加以排列;
處理模組10將該全單字列表中的各單字依照單字的值重新排列,儲存為該黑名單單字列表(S1022)。
Please refer to FIG. 3 . In one embodiment of the present invention, the words in the blacklist word list are sorted according to the value of each word, so that the same words are arranged together. Among them, in the step of establishing the blacklist word list (S102), the following sub-steps can be performed:
The
由於本發明的黑名單單字列表已儲存於儲存模組20,當處理模組10進行待查找單字與黑名單單字列表中的單字進行比對時,可直接由近端的儲存模組20進行讀取,不須再通過網際網路存取遠端的黑名單資料庫30,因而提高處理模組10進行查找比對的速度;此外,處理模組10係以待查找單字對黑名單單字列表中的單字進行一對一比對,且黑名單單字列表已依照各該單字的值排列,當處理模組10找到相符的單字時,便能在儲存模組20中相近的儲存位址找到其他相符的單字,進一步提高查找比對之效率;因此,相較以輸入姓名字串對黑名單資料庫中的名字資料字串進行複雜的交叉比對,本發明在整篇待查找文件中查找出可能的目標姓名將更有明顯的處理速度優勢。Since the blacklist word list of the present invention has been stored in the
在本發明的一實施例中,當處理模組10接收到待查找文件時,係先建立一待查找列表並儲存於儲存模組20中,該待查找列表中包含該待查找文件中的連續的複數組待查找單字,且該複數組待查找單字包含該待查找文件中的所有待查找單字。該處理模組10在建立該待查找列表後,先由第一組待查找單字進行步驟S104至S107的比對,比對完成後,再移至該待查找列表中的下一組待查找單字進行比對,直到完成該待查找列表中各組待查找單字的比對。In an embodiment of the present invention, when the
請參閱圖4所示,較佳的,該待查找列表根據以下子步驟建立:
處理模組10由該待查找文件中起始的一待查找單字開始,讀取連續的一單次比對數量個待查找單字,並儲存至該待查找列表(S401);
處理模組10由上一步驟中讀取的複數個待查找單字中的第二個待查找單字開始,讀取連續的該單次比對數量的待查找單字,並儲存至該待查找列表(S402);
處理模組10重複執行上一步驟,直到讀取的複數個待查找單字的一末單字為該待查找文件中的最末端的待查找單字(S403);
完成建立該待查找列表。
Please refer to Figure 4. Preferably, the to-be-searched list is established according to the following sub-steps:
The processing
也就是說,該待查找列表的建立方式是由起始單字開始,讀取並儲存單次比對數量個待查找單字後,平移一個單字,由文件中的第二個待查找單字開始再讀取並儲存單次比對數量個待查找單字,並再次平移讀取並儲存,直到讀取並儲存整個待查找文件中的最末端的結束單字。如此一來,該待查找列表中包含有該待查找文件中從起始單字至結束單字中的所有連續的複數個待查找單字。該單次比對數量則決定了該處理模組10在進行一次命中比對中所要比對的待查找單字數量,其可根據黑名單資料庫30中的名字資料的字數平均值、最高值或根據經驗法則訂定。舉例而言,若該單次比對數量為5,即代表每一組待查找單字中包含連續的5個待查找單字。That is to say, the way to create the search list is to start from the starting word, read and store the number of words to be searched for in a single comparison, then move one word, and start reading from the second word to be searched in the file. Fetch and store the number of words to be searched for in a single comparison, and read and store them in translation again until the end word in the entire file to be searched is read and stored. In this way, the search list includes all consecutive words to be searched from the starting word to the end word in the file to be searched. The number of single comparisons determines the number of words to be found that the
其中,當讀取連續的單次比對數量個待查找單字時,該處理模組10還進一步判斷該等待查找單字中是否包含有連續且相同的二個待查找單字;若有,則忽略該連續且相同的其中一個待查找單字,並將下一待查找單字更新至該組待查找單字中。When reading a number of consecutive words to be searched for in a single comparison, the
也就是說,在建立該待查找列表時,當其中一組待查找單字中包含有連續且重複的待查找單字時,則忽略連續且重複的其中一待查找單字,並進一步讀取下一個待查找單字,避免同一組待查找單字中出現重複的待查找單字而導致重複比對。That is to say, when establishing the search list, when one of the groups of words to be searched contains consecutive and repeated words to be searched, one of the consecutive and repeated words to be searched is ignored, and the next word to be searched is further read. Search words to avoid repeated comparisons caused by duplicate words to be found in the same set of words to be searched.
在本發明的另一實施例中,在建立該黑名單單字列表之前,係先根據包含有複數筆常用字彙的一常用字彙列表比對各該名字資料,移除各該名字資料中包含的常用字彙,才根據已移除常用字彙的各該名字資料建立該黑名單單字列表。In another embodiment of the present invention, before establishing the blacklist word list, each name data is compared against a common word list containing a plurality of commonly used words, and the common words contained in each name data are removed. Vocabulary, the blacklist word list is created based on the name data from which the common vocabulary has been removed.
在本實施例中,該等常用字彙例如為中文的「公司」、「有限公司」、「財團法人」;英文的「COMPANY LIMITED」、「COMPANY」、「LIMITED」、「IMPORT EXPORT CORP」、「IMPORT EXPORT CORPORATION」、「IMPORT AND EXPORT CORPORATION」等。由於該等常用字彙在黑名單資料庫30屬於不具有名字代表性意義的雜訊文字,因此當判斷該名字資料中包含有該等常用字彙時,會先移除該等常用字彙,使黑名單單字列表中的單字數量更為精簡,提高比對效率。In this embodiment, these commonly used words are, for example, "company", "limited company", and "foundation" in Chinese; "COMPANY LIMITED", "COMPANY", "LIMITED", "IMPORT EXPORT CORP", " IMPORT EXPORT CORPORATION", "IMPORT AND EXPORT CORPORATION", etc. Since these common words in the
以下將以範例說明本發明的於文件中快速查找預設名單之方法。The method of quickly searching for a default list in a document according to the present invention will be described below with an example.
在本範例中,黑名單資料庫30中所包含的紀錄編號、名字資料及字數資料如下方的表一所示。
其中,C1表示計數值為1,C2表示計數值為2,以此類推,Cn表示計數值為n。因此,字數資料為C1表示該筆名字資料包含1個單字,C2表示該筆名字資料包含2個單字,以此類推,Cn即代表名字資料包含有n個單字。Among them, C1 indicates that the count value is 1, C2 indicates that the count value is 2, and so on, and Cn indicates that the count value is n. Therefore, the word count data is C1, which means that the name data contains 1 word, C2 means that the name data contains 2 words, and so on, Cn means that the name data contains n words.
根據步驟S102及其子步驟所建立的該黑名單單字列表如下方的表二所示:
在本範例中,待查找文件的內容如下: 「REGARDING OUR ACKNOWLEDGEMENT CONCERNING GIAD HEAVY INDUSTRIES COMPLEX DATED DD 20200929 WE HAVE TODAY SENT A SECOND REMINDER ON YOUR BEHALF. FOR ANY FUTURE CORRESPONDENCE RELATED TO THIS CASE PLEASE QUOTE OUR ENQUIRY REFERENCE USP200928-000830. REGARDS CLIENT SERVICES」 In this example, the content of the file to be found is as follows: "REGARDING OUR ACKNOWLEDGEMENT CONCERNING GIAD HEAVY INDUSTRIES COMPLEX DATED DD 20200929 WE HAVE TODAY SENT A SECOND REMINDER ON YOUR BEHALF. FOR ANY FUTURE CORRESPONDENCE RELATED TO THIS CASE PLEASE QUOTE OUR ENQUIRY REFERENCE USP200928-000830. REGARDS CLIENT SERVICES"
根據步驟S401所建立的待查找列表如下方的表三所示。其中,示例性地設定該單次比對字數為5,因此每一組待查找單字中包含5個連續的待查找單字,表三中總共列出了32組待查單字(WL1~WL32):
在步驟S104中,該處理模組10將各組待查找單字中的待查找單字與黑名單單字列表(表二)互相比對,將與黑名單單字列表中的單字相同的任一個待查找單字以及該黑名單單字列表中的單字對應的紀錄編號建立為目標單字列表。例如由第一組待查找單字(WL1)進行比對後建立的目標單字列表如下方的表四所示:
在步驟S105中,計算該目標單字列表中對應相同紀錄編號的單字的數量,並將該數量紀錄為對應該紀錄編號的命中字數。舉例而言,該目標單字列表中,對應紀錄編號「R3」的單字為「GIAD」,即對應該紀錄編號「R3」的單字只有1個,因此命中字數紀錄為C1。此一步驟中可將沒有比對到黑名單單字列表中單字的待查找單字移除,例如移除“REGARDING”、“OUR”、“ACKNOWLEDGEMENT”、“CONCERNING”等單字。進一步在該目標單字列表中記錄該命中字數,如下方的表五所示:
接著,在步驟S106至S107中,判斷該組待查找單字中是否包含有一組命中名字資料,其中,示例性地設定該命中範圍為「1」。在上一步驟(S105)判斷完每一紀錄編號對應的命中單字的數量後,進一步根據表一查詢每一紀錄編號對應的字數資料;接著,判斷對應各紀錄編號的「命中字數」與「字數資料」的差值是否小於該命中範圍;若是,表示該紀錄編號對應的名字資料為命中名字資料,該組待查找單字中包含黑名單資料庫中的該筆名字資料。比對結果如下方的表六所示:
根據表六的「是否命中」欄位可知,由本發明的方法可判斷該待查找文件中的第一組待查找單字(WL1)中不包含有該黑名單資料庫30中的任一筆名字資料,因此第一組待查找單字(WL1)中沒有包含命中名字資料。According to the "hit or not" column in Table 6, the method of the present invention can determine that the first group of words to be searched (WL1) in the file to be searched does not contain any name data in the
當比對完成該第一組待查找單字(WL1)後,該處理模組10依序對該待查找列表中的第二組、第三組….待查找單字(WL2、WL3、…)進行比對,直到完成每一組待查找單字的比對。After the comparison is completed on the first group of words to be found (WL1), the
以下再以第四組待查找單字(WL4)為例說明比對方式,其中,第四組待查找單字(WL4)的內容為「CONCERNING,GIAD,HEAVY,INDUSTRIES,COMPLEX」。在步驟S104中,根據該第四組待查找單字(WL4)建立的目標單字列表如下:
根據步驟S105,計算該目標單字列表中對應相同紀錄編號的單字的數量。舉例而言,該目標單字列表中,對應紀錄編號「R2」的單字為“HEAVY”及“INDUSTRIES”,即對應該紀錄編號「R2」的單字有2個,因此命中字數紀錄為C2;對應紀錄編號「R3」的單字為“GIAD”、“HEAVY”及“INDUSTRIES”及“COMPLEX”,即對應該記錄編號「R3」的單字有4個,因此命中字數紀錄為C4。紀錄該命中字數資料的目標單字列表如下:
在步驟S106至S107中,判斷該組待查找單字中是否包含有一組命中名字資料。同樣的,查詢目標單字列表中對應每一紀錄編號的字數資料,以判斷對應各紀錄編號的「命中字數」與「字數資料」的差值是否小於該命中範圍「1」,若是,則判斷該紀錄編號對應的名字資料為命中名字資料。完整比對結果列表如下表所示:
該第四組待查找單字(WL4)的內容為「CONCERNING,GIAD,HEAVY,INDUSTRIES,COMPLEX」,即由該待查找文件中的第4個開始的連續5個待查找單字中,共包含了4個與紀錄編號為R3的名字資料「GIAD HEAVY INDUSTRIES COMPLEX」相同的待查找單字,且其命中字數C4與R3對應的字數資料C4差值為0,小於命中範圍1,因此判斷該R3對應的該筆名字資料為命中名字資料。The content of the fourth group of words to be searched (WL4) is "CONCERNING, GIAD, HEAVY, INDUSTRIES, COMPLEX", that is, the 5 consecutive words to be searched starting from the 4th one in the file to be searched include a total of 4 The word to be searched is the same as the name data "GIAD HEAVY INDUSTRIES COMPLEX" with record number R3, and the difference between the hit number C4 and the word number data C4 corresponding to R3 is 0, which is less than the hit range 1, so it is judged that R3 corresponds The name data is the hit name data.
綜上所述,本發明的於文件中快速查找預設名單之方法及系統能夠在一待查找文件中,由文件起始單字開始每次抓取特定長度的連續複數待查找單字,依序掃描全篇文件,依各待查找單字與事先建立好的黑名單單字列表依序比對單字,比對完成後計算對應相同紀錄編號的命中字數,再將命中字數與該筆名字資料的字數資料比對,而得到是否命中之結果。方法過程中是藉由先建立好依名字資料中的單字值排列的該黑名單單字列表,在真正進行比對判斷是否為命中名字資料時,僅須比對單字是否相同,以及簡單比較命中字數與字數資料,執行運算負擔低且速度快。且由於比對待查找單字及黑名單單字列表時是單字分別之比對,無論各組待查找單字中包含的命中名字的單字排列與原始黑名單資料庫中的名字資料中的單字順序是否相同,皆能夠找出與名字資料差異在一定範圍內的所有待查找單字,解決由整篇文件中連續多個不特定文字無法模糊比對黑名單資料庫中的名字資料之問題。In summary, the method and system for quickly searching for a preset list in a file of the present invention can capture consecutive plural words of a specific length to be searched starting from the starting word of the file in a file to be searched, and scan them in sequence. In the entire document, each word to be searched is compared with the pre-established blacklist word list in order. After the comparison is completed, the number of hit words corresponding to the same record number is calculated, and then the number of hit words is compared with the number of words in the name data. Compare the data and get the result of whether it is hit or not. The process of the method is to first establish the blacklist word list arranged according to the word value in the name data. When actually comparing to determine whether it is a hit name data, it only needs to compare whether the words are the same and simply compare the hit words. Numerical and word-count data, the execution burden is low and the speed is fast. And since the comparison of the words to be searched and the list of blacklisted words is a comparison of individual words, no matter whether the character arrangement of the hit names contained in each group of words to be searched is the same as the order of the words in the name data in the original blacklist database, It can find all the words to be searched that differ from the name data within a certain range, solving the problem of being unable to fuzzy compare the name data in the blacklist database with multiple consecutive unspecific words in the entire document.
以上所述僅是本發明的實施例而已,並非對本發明做任何形式上的限制,雖然本發明已以實施例揭露如上,然而並非用以限定本發明,任何熟悉本專業的技術人員,在不脫離本發明技術方案的範圍內,當可利用上述揭示的技術內容做出些許更動或修飾為等同變化的等效實施例,但凡是未脫離本發明技術方案的內容,依據本發明的技術實質對以上實施例所作的任何簡單修改、等同變化與修飾,均仍屬於本發明技術方案的範圍內。The above descriptions are only embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed in the embodiments above, they are not used to limit the present invention. Any skilled person familiar with the art will not Without departing from the scope of the technical solution of the present invention, the technical content disclosed above can be used to make some changes or modifications to equivalent embodiments with equivalent changes. Any simple modifications, equivalent changes and modifications made to the above embodiments still fall within the scope of the technical solution of the present invention.
10:處理模組 20:儲存模組 30:黑名單資料庫 10: Processing module 20:Storage module 30: Blacklist database
圖1係本發明於文件中快速查找預設名單之方法的一判斷流程圖。 圖2係本發明於文件中快速查找預設名單之系統的一方塊示意圖。 圖3係本發明於文件中快速查找預設名單之方法的一實施例的部分的判斷流程圖。 圖4係本發明於文件中快速查找預設名單之方法的一實施例的部分的判斷流程圖。 Figure 1 is a judgment flow chart of the method of quickly searching for a preset list in a file according to the present invention. Figure 2 is a block diagram of a system for quickly searching for a preset list in a document according to the present invention. FIG. 3 is a partial judgment flow chart of an embodiment of a method for quickly searching for a preset list in a file according to the present invention. FIG. 4 is a partial judgment flow chart of an embodiment of a method for quickly searching for a preset list in a file according to the present invention.
Claims (14)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110122165A TWI816141B (en) | 2021-06-17 | 2021-06-17 | System and method for quickly searching for default lists in documents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110122165A TWI816141B (en) | 2021-06-17 | 2021-06-17 | System and method for quickly searching for default lists in documents |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202301145A TW202301145A (en) | 2023-01-01 |
TWI816141B true TWI816141B (en) | 2023-09-21 |
Family
ID=86658109
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW110122165A TWI816141B (en) | 2021-06-17 | 2021-06-17 | System and method for quickly searching for default lists in documents |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI816141B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWM592561U (en) * | 2019-08-30 | 2020-03-21 | 台中商業銀行股份有限公司 | Blacklist database searching system |
US20200250139A1 (en) * | 2018-12-31 | 2020-08-06 | Dathena Science Pte Ltd | Methods, personal data analysis system for sensitive personal information detection, linking and purposes of personal data usage prediction |
-
2021
- 2021-06-17 TW TW110122165A patent/TWI816141B/en active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200250139A1 (en) * | 2018-12-31 | 2020-08-06 | Dathena Science Pte Ltd | Methods, personal data analysis system for sensitive personal information detection, linking and purposes of personal data usage prediction |
TWM592561U (en) * | 2019-08-30 | 2020-03-21 | 台中商業銀行股份有限公司 | Blacklist database searching system |
Non-Patent Citations (1)
Title |
---|
網路文獻 臺灣集中保管結算所 洗錢防制查詢系統 2017年10月 https://smart.tdcc.com.tw/attach/etraining/T_299.pdf * |
Also Published As
Publication number | Publication date |
---|---|
TW202301145A (en) | 2023-01-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10002123B2 (en) | Named entity extraction from a block of text | |
US8447764B2 (en) | Indexing and searching of electronic message transmission thread sets | |
US7565347B2 (en) | Indexing and searching of electronic message transmission thread sets | |
Bar-Yossef et al. | Do not crawl in the DUST: Different URLs with similar text | |
US7324998B2 (en) | Document search methods and systems | |
US8082270B2 (en) | Fuzzy search using progressive relaxation of search terms | |
US20120166414A1 (en) | Systems and methods for relevance scoring | |
US8392949B2 (en) | Electronic-program-guide retrieval method and electronic-program-guide retrieval system | |
US7984036B2 (en) | Processing a text search query in a collection of documents | |
US9262511B2 (en) | System and method for indexing streams containing unstructured text data | |
US7499927B2 (en) | Techniques for improving memory access patterns in tree-based data index structures | |
CN106649286B (en) | One kind carrying out the matched method of term based on even numbers group dictionary tree | |
US7788284B2 (en) | System and method for knowledge based search system | |
CN108304469B (en) | Method and device for fuzzy matching of character strings | |
TWI816141B (en) | System and method for quickly searching for default lists in documents | |
TWM619063U (en) | System capable of quickly finding the preset list in the document | |
Cha | An effective and efficient indexing scheme for audio fingerprinting | |
EP2780830A1 (en) | Fast database matching | |
Gao et al. | Support for interactive identification of mentioned entities in conversational speech | |
US9846739B2 (en) | Fast database matching | |
CN110825747A (en) | Information access method, device and medium | |
Dagade et al. | De-duplication framework to reduce the record linkage problem | |
JP3260706B2 (en) | A search system that searches for files stored on the hard disk of a personal computer | |
Butakov et al. | Detecting text similarity on a scalable no-SQL database platform | |
CN110347804B (en) | Sensitive information detection method of linear time complexity |