TWI700664B - Text processing method and system - Google Patents
Text processing method and system Download PDFInfo
- Publication number
- TWI700664B TWI700664B TW108121204A TW108121204A TWI700664B TW I700664 B TWI700664 B TW I700664B TW 108121204 A TW108121204 A TW 108121204A TW 108121204 A TW108121204 A TW 108121204A TW I700664 B TWI700664 B TW I700664B
- Authority
- TW
- Taiwan
- Prior art keywords
- text
- target
- analyzed
- processing module
- vector group
- Prior art date
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一種文本處理系統,包含一儲存模組及一處理模組。該儲存模組儲存有多個連結網址及多個關鍵詞,其中,對於每一連結網址,該處理模組根據該連結網址,獲得該連結網址所對應的一欲分析文本,對於每一欲分析文本,該處理模組利用斷詞演算法,獲得對應該欲分析文本的多個斷詞,該處理模組根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞,該處理模組根據每一目標文本所對應的該至少一目標斷詞,利用聚類分群演算法,將該等目標文本分為至少一群集。A text processing system includes a storage module and a processing module. The storage module stores multiple destination URLs and multiple keywords. For each destination URL, the processing module obtains a text to be analyzed corresponding to the destination URL according to the destination URL, and for each destination URL to be analyzed Text, the processing module uses a word segmentation algorithm to obtain multiple word breaks corresponding to the text to be analyzed. The processing module uses the word breaks and keywords corresponding to each text to be analyzed, from the In the text to be analyzed, a plurality of target texts and at least one target segmentation corresponding to them are obtained, and the processing module uses a clustering and grouping algorithm according to the at least one target segmentation corresponding to each target text. The target text is divided into at least one cluster.
Description
本發明是有關於一種相關於自然語言處理技術的文本處理系統,特別是指一種應用於洗錢防制領域的文本處理系統。The invention relates to a text processing system related to natural language processing technology, in particular to a text processing system applied to the field of money laundering prevention and control.
洗錢防制/打擊資助恐怖主義(AML/CFT,Anti-Money Laundering/Combating the Financing of Terrorism)相關法規與作業規範日趨嚴謹,辨識客戶身分、客戶盡職調查作業亦愈趨繁複,AML作業相關人力投入亦均隨之大幅增加。Anti-Money Laundering/Combating the Financing of Terrorism (AML/CFT, Anti-Money Laundering/Combating the Financing of Terrorism) laws and regulations and operating standards are becoming more and more stringent. Customer identification and customer due diligence operations are becoming more and more complicated. Manpower input for AML operations Both also increased substantially.
現行AML姓名檢核作業若觸及負面新聞名單,則需逐條檢閱每則新聞、逐字閱讀新聞內容為真警報或假警報外,亦需同時判斷新聞事件主角與所屬客戶是否為同一人,必須參考分散於內部不同系統與網站的資料以判斷客戶身分是否相同。而必須費時跨系統查找各系統交易資料、耗工蒐集彙整客戶及其關係關聯人資訊,故造成姓名檢核作業速度緩慢、產生作業錯誤風險機率較高等之人工作業痛點。隨著各項金融業務快速成長、疑似洗錢或資恐交易態樣持續完善發展、AML系統警示機制功能不斷開發下,觸及負面新聞之姓名檢核案件亦同步大幅增加,形成作業人員工作超載。If the current AML name check operation touches the negative news list, you need to review each news item by item, read the news verbatim as a true alert or false alert, and also determine whether the protagonist of the news event and the customer belong to the same person. Refer to the information scattered in different internal systems and websites to determine whether the customer identity is the same. It is time-consuming to search for transaction data of each system across systems, and labor to collect and aggregate information on customers and their related associates, resulting in slow name verification operations and high risk of manual operations, such as operational errors. With the rapid growth of various financial businesses, the continuous improvement and development of suspected money laundering or terrorist transactions, and the continuous development of the AML system warning mechanism, the number of name check cases that touched negative news has also increased significantly, resulting in an overload of operators.
因此,為紓減人力配置重擔與減少錯誤判斷,運用自然語言分析相關於AML文本,以提升案件審查效率,強化負面新聞案件審查品質與作業一致性,減少作業人力需求並降低合規成本。Therefore, in order to reduce the burden of manpower allocation and reduce misjudgments, natural language analysis is used to analyze relevant AML texts to improve the efficiency of case review, strengthen the quality of review of negative news cases and work consistency, reduce labor requirements and reduce compliance costs.
因此,本發明的目的,即在提供一種運用自然語言分析的文本處理方法。Therefore, the purpose of the present invention is to provide a text processing method using natural language analysis.
於是,本發明文本處理方法,藉由一電子裝置來實施,該電子裝置儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞,該文本處理方法包含一步驟(A)、一步驟(B)、一步驟(C),以及一步驟(D)。Therefore, the text processing method of the present invention is implemented by an electronic device that stores multiple URLs for linking to multiple texts to be analyzed, and multiple keywords. The text processing method includes a step (A ), one step (B), one step (C), and one step (D).
步驟(A)是對於每一連結網址,藉由該電子裝置,根據該連結網址,獲得該連結網址所對應的該欲分析文本。Step (A) is to use the electronic device to obtain the text to be analyzed corresponding to the destination URL according to the destination URL for each destination URL.
步驟(B)是對於每一欲分析文本,藉由該電子裝置,根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞。In step (B), for each text to be analyzed, using the electronic device, according to the text to be analyzed, a segmentation algorithm is used to obtain multiple word segments corresponding to the text to be analyzed.
步驟(C)是藉由該電子裝置,根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞。Step (C) is to use the electronic device to obtain multiple target texts and at least one corresponding one from the texts to be analyzed based on the word breaks and keywords corresponding to each text to be analyzed Target hyphenation.
步驟(D)是藉由該電子裝置,根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集。Step (D) is to use the electronic device to divide the target text into at least one cluster according to the at least one target segmentation corresponding to each target text by using a clustering algorithm.
本發明之另一目的,即在提供一種運用自然語言分析的文本處理系統。Another object of the present invention is to provide a text processing system using natural language analysis.
於是,本發明文本處理系統包含一儲存模組,以及一電連接該儲存模組的處理模組。Therefore, the text processing system of the present invention includes a storage module and a processing module electrically connected to the storage module.
該儲存模組儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞。The storage module stores multiple URLs for linking to multiple texts to be analyzed, and multiple keywords.
其中,對於每一連結網址,該處理模組根據該連結網址,獲得該連結網址所對應的該欲分析文本,對於每一欲分析文本,該處理模組根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞,該處理模組根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞,該處理模組根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集。Wherein, for each destination URL, the processing module obtains the text to be analyzed corresponding to the destination URL according to the destination URL, and for each text to be analyzed, the processing module uses a word segmentation according to the text to be analyzed Algorithm to obtain multiple word breaks corresponding to the text to be analyzed. The processing module obtains multiple targets from the text to be analyzed according to the word breaks and keywords corresponding to each text to be analyzed For the text and at least one target segmentation corresponding to the text, the processing module uses a clustering algorithm to divide the target text into at least one cluster according to the at least one target segmentation corresponding to each target text.
本發明之功效在於:藉由該處理模組自該等欲分析文本中,獲得多個目標文本及其所對應的該至少一目標斷詞,並利用該聚類分群演算法,將該等目標文本分為該至少一群集,如此一來,當於檢核作業逐條審查時,僅需要審查每一群集中的任一個目標文本即可達成與習知作法的相同功效,大大地提升案件審查效率,強化負面新聞案件審查品質與作業一致性,並減少作業人力需求並降低合規成本。The effect of the present invention is to obtain a plurality of target texts and the corresponding at least one target segmentation from the texts to be analyzed by the processing module, and use the clustering algorithm to obtain the target texts. The texts are divided into at least one cluster. In this way, when reviewing one by one, you only need to review any target text in each cluster to achieve the same effect as the conventional practice, greatly improving the efficiency of case review , Strengthen the consistency of the review quality and operation of negative news cases, reduce the demand for operation manpower and reduce compliance costs.
在本發明被詳細描述之前,應當注意在以下的說明內容中,類似的元件是以相同的編號來表示。Before the present invention is described in detail, it should be noted that in the following description, similar elements are represented by the same numbers.
參閱圖1,本發明文本處理系統是應用於洗錢防制的文本處理系統,其實施例包含一電子裝置1。該電子裝置1包含一儲存模組11、一顯示模組12,以及一電連接該儲存模組11及該顯示模組12的處理模組13,在本實施例中,特別是應用於洗錢防制。Referring to FIG. 1, the text processing system of the present invention is a text processing system applied to money laundering prevention, and its embodiment includes an
該儲存模組11儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞。在本實施例中,該等關鍵詞是多個相關於洗錢領域的洗錢關鍵詞。The
在該實施例中,該電子裝置1之實施態樣例如為一個人電腦、一伺服器或一雲端主機,但不以此為限。In this embodiment, the implementation of the
參閱圖2,以下將藉由本發明應用於洗錢防制的文本處理系統執行一應用於洗錢防制的文本處理方法來說明該電腦裝置1之該儲存模組11、該顯示模組12,以及該處理模組13各元件的運作細節,該文本處理方法包含一步驟51、一步驟52、一步驟53,以及一步驟54。Referring to FIG. 2, the text processing system applied to money laundering prevention of the present invention executes a text processing method applied to money laundering prevention to illustrate the
在步驟51中,對於每一連結網址,該處理模組13根據該連結網址,獲得該連結網址所對應的該欲分析文本。In
在步驟52中,對於每一欲分析文本,該處理模組13根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞。值得特別說明的是,在本實施例中,該斷詞演算法係為[Ma, Wei-Yun and Chen, Keh-Jiann]於 2003所發表之習知技術。In
在步驟53中,該處理模組13根據每一欲分析文本所對應的該等斷詞及該等洗錢關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞。In
參閱圖3,值得特別說明的是,步驟53還進一步包含一子步驟531,以及一子步驟532。Referring to FIG. 3, it is worth noting that
在子步驟531中,對於每一欲分析文本,該處理模組13判定該欲分析文本所對應的該等斷詞中是否存在於與該等洗錢關鍵詞之其中任一者相符的至少一目標斷詞。當該處理模組13判定出該欲分析文本存在有對應的該至少一目標斷詞時,進行流程步驟532;當該處理模組13判定出該欲分析文本不存在有對應的該至少一目標斷詞時,結束該應用於洗錢防制的文本處理方法。In
在子步驟532中,對於每一欲分析文本,該處理模組13將該欲分析文本作為該目標文本,並獲得其所對應的該至少一目標斷詞。In
在步驟54中,該處理模組13根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集並顯示於該顯示模組12。In
參閱圖4,值得特別說明的是,步驟54還進一步包含一子步驟541,以及一子步驟542。4, it is worth noting that
在子步驟541中,對於每一目標文本,該處理模組13根據該目標文本所對應的該至少一目標斷詞,利用一用於將文本轉成數值向量的文本嵌入模型,獲得對應該目標文本的一文本向量組。其中,該處理模組13係根據每一目標文本所對應的該至少一目標斷詞,利用該文本嵌入模型,獲得每一目標文本所對應的該文本向量組。值得特別說明的是,在本實施例中,該文本嵌入模型係為[Le and Mikolov]於2014年所發表的 PV-DBOW(Paragraph Vector - Distributed Bag of Words),但不以此為限。In
在子步驟542中,根據每一目標文本所對應的該文本向量組,利用該聚類分群演算法,將該等目標文本分為該至少一群集並顯示於該顯示模組12。其中,每一群集係為由樹狀結構表示的樹。值得特別說明的是,在本實施例中,該聚類分群演算法係為[Zhang et al]於1996年所發表的平衡式反覆化簡和層級分群法BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies),但不以此為限。In
參閱圖5,值得特別說明的是,子步驟542還進一步包含一子步驟542A、一子步驟542B、一子步驟542C、一子步驟542D、一子步驟542E,以及一子步驟542F。5, it is worth noting that the
在子步驟542A中,該處理模組13將一欲分群文本向量組歸類為一候選群集,該欲分群文本向量組為該等文本向量組之其中一者。In sub-step 542A, the
在子步驟542B中,該處理模組13判定下一個欲分群文本向量組是否屬於當前所存在的候選群集之其中一者,該下一欲分群文本向量組為尚未被歸類的文本向量組之其中一者。當該處理模組13判定出定該下一個欲分群文本向量組屬於該當前所存在的候選群集之其中之該者時,進行流程子步驟542C;當該處理模組13判定出定該下一個欲分群文本向量組不屬於任一候選群集時,進行流程子步驟542D。特別地,該處理模組13係藉由判定該下一個欲分群文本向量組於加入當前所存在的候選群集之其該者後,當前所存在的候選群集之其該者於向量空間中整體距離之遠近是否超過一預設閾值,以判定該下一個欲分群文本向量組是否屬於當前所存在的候選群集之其中該者。In
在子步驟542C中,該處理模組13將該下一個欲分群文本向量組歸類為當前所存在的候選群集之其中之該者。In
在子步驟542D中,該處理模組13將該下一個欲分群文本向量組歸類為另一新的候選群集。In
在子步驟542E中,該處理模組13判定是否還有尚未被歸類的文本向量組。當該處理模組13判定還有尚未被歸類的文本向量組時,回到流程子步驟542B;當該處理模組13判定無任何尚未被歸類的文本向量組時,進行流程子步驟542F。In
在子步驟542F中,該處理模組13將當前所存在的候選群集作為該至少一群集並顯示於該顯示模組12。In
綜上所述,本發明應用於洗錢防制的文本處理系統,藉由該處理模組13自該等欲分析文本中,篩選出多個與洗錢相關的目標文本及其所對應的該至少一目標斷詞,接著,利用該文本嵌入模型,將每一目標文本所對應的該至少一目標斷詞,轉換為每一目標文本所對應的該文本向量組,再利用該聚類分群演算法,將該等目標文本分為各個由樹狀結構所表示的該至少一群集,如此一來,當於檢核作業逐條審查時,僅需要審查每一群集之樹根所代表的該目標文本,即可達成與習知作法的相同功效,大大地提升案件審查效率,強化負面新聞案件審查品質與作業一致性,並減少作業人力需求並降低合規成本。因此,故確實能達成本發明的目的。In summary, the present invention is applied to a text processing system for money laundering prevention. The
惟以上所述者,僅為本發明的實施例而已,當不能以此限定本發明實施的範圍,凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾,皆仍屬本發明專利涵蓋的範圍內。However, the above are only examples of the present invention. When the scope of implementation of the present invention cannot be limited by this, all simple equivalent changes and modifications made in accordance with the scope of the patent application of the present invention and the content of the patent specification still belong to Within the scope of the patent for the present invention.
1:電子裝置1: Electronic device
11:儲存模組11: Storage module
12:顯示模組12: Display module
13:處理模組13: Processing module
51~54:步驟51~54: steps
531~532:子步驟531~532: Sub-step
541~542:子步驟541~542: Sub-step
542A~542F:子步驟542A~542F: sub-step
本發明的其他的特徵及功效,將於參照圖式的實施方式中清楚地呈現,其中: 圖1是一方塊圖,說明本發明文本處理系統的一實施例; 圖2是一流程圖,說明該實施例所執行之一文本處理方法; 圖3是一流程圖,說明該文本處理方法如何獲得一目標文本及其所對應的至少一目標斷詞的細部流程; 圖4是一流程圖,說明該文本處理方法如何獲得所有目標文本的文本向量組,並將其分為至少一群集的細部流程;及 圖5是一流程圖,說明該文本處理方法如何將所有目標文本分為至少一群集的細部流程。The other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, in which: Figure 1 is a block diagram illustrating an embodiment of the text processing system of the present invention; Figure 2 is a flowchart illustrating A text processing method executed by this embodiment; Figure 3 is a flowchart illustrating how the text processing method obtains a target text and the detailed flow of at least one target segmentation corresponding to it; Figure 4 is a flowchart illustrating How the text processing method obtains the text vector group of all target texts and divides them into at least one cluster of detailed processes; and Figure 5 is a flowchart illustrating how the text processing method divides all target texts into at least one cluster Detailed process.
1:電子裝置 1: Electronic device
11:儲存模組 11: Storage module
12:顯示模組 12: Display module
13:處理模組 13: Processing module
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108121204A TWI700664B (en) | 2019-06-19 | 2019-06-19 | Text processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108121204A TWI700664B (en) | 2019-06-19 | 2019-06-19 | Text processing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
TWI700664B true TWI700664B (en) | 2020-08-01 |
TW202101363A TW202101363A (en) | 2021-01-01 |
Family
ID=73002939
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW108121204A TWI700664B (en) | 2019-06-19 | 2019-06-19 | Text processing method and system |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI700664B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW200719172A (en) * | 2005-11-04 | 2007-05-16 | Webgenie Information Ltd | Method for automatically detecting similar documents |
CN101763404A (en) * | 2009-12-10 | 2010-06-30 | 陕西鼎泰科技发展有限责任公司 | Network text data detection method based on fuzzy cluster |
US20130144874A1 (en) * | 2010-11-05 | 2013-06-06 | Nextgen Datacom, Inc. | Method and system for document classification or search using discrete words |
CN103631809A (en) * | 2012-08-24 | 2014-03-12 | 宏碁股份有限公司 | Data clustering device and method |
TWM585945U (en) * | 2019-06-19 | 2019-11-01 | 中國信託商業銀行股份有限公司 | Text processing system |
-
2019
- 2019-06-19 TW TW108121204A patent/TWI700664B/en active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW200719172A (en) * | 2005-11-04 | 2007-05-16 | Webgenie Information Ltd | Method for automatically detecting similar documents |
CN101763404A (en) * | 2009-12-10 | 2010-06-30 | 陕西鼎泰科技发展有限责任公司 | Network text data detection method based on fuzzy cluster |
US20130144874A1 (en) * | 2010-11-05 | 2013-06-06 | Nextgen Datacom, Inc. | Method and system for document classification or search using discrete words |
CN103631809A (en) * | 2012-08-24 | 2014-03-12 | 宏碁股份有限公司 | Data clustering device and method |
TWM585945U (en) * | 2019-06-19 | 2019-11-01 | 中國信託商業銀行股份有限公司 | Text processing system |
Also Published As
Publication number | Publication date |
---|---|
TW202101363A (en) | 2021-01-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
NL2012438B1 (en) | Resolving similar entities from a database. | |
CN105915555B (en) | Method and system for detecting network abnormal behavior | |
US9317613B2 (en) | Large scale entity-specific resource classification | |
CN104112026B (en) | A kind of short message text sorting technique and system | |
WO2022134794A1 (en) | Method and apparatus for processing public opinions about news event, storage medium, and computer device | |
CN110851598B (en) | Text classification method and device, terminal equipment and storage medium | |
CN104077407B (en) | A kind of intelligent data search system and method | |
WO2023093100A1 (en) | Method and apparatus for identifying abnormal calling of api gateway, device, and product | |
WO2023284132A1 (en) | Method and system for analyzing cloud platform logs, device, and medium | |
CN110532352B (en) | Text duplication checking method and device, computer readable storage medium and electronic equipment | |
CN105975547B (en) | Based on content web document detection method approximate with position feature | |
WO2012083874A1 (en) | Webpage information detection method and system | |
CN103455758A (en) | Method and device for identifying malicious website | |
CN117971606B (en) | Log management system and method based on elastic search | |
CN109271614A (en) | A kind of data duplicate checking method | |
CN111639077A (en) | Data management method and device, electronic equipment and storage medium | |
CN114116811B (en) | Log processing method, device, equipment and storage medium | |
CN109471934B (en) | Financial risk clue mining method based on Internet | |
CN116841779A (en) | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium | |
CN113806492A (en) | Record generation method, device and equipment based on semantic recognition and storage medium | |
TWI700664B (en) | Text processing method and system | |
CN109409091B (en) | Method, device and equipment for detecting Web page and computer storage medium | |
CN112308251A (en) | Work order assignment method and system based on machine learning | |
TWM585945U (en) | Text processing system | |
CN111738290A (en) | Image detection method, model construction and training method, device, equipment and medium |