TWI700664B - Text processing method and system - Google Patents

Text processing method and system Download PDF

Info

Publication number
TWI700664B
TWI700664B TW108121204A TW108121204A TWI700664B TW I700664 B TWI700664 B TW I700664B TW 108121204 A TW108121204 A TW 108121204A TW 108121204 A TW108121204 A TW 108121204A TW I700664 B TWI700664 B TW I700664B
Authority
TW
Taiwan
Prior art keywords
text
target
analyzed
processing module
vector group
Prior art date
Application number
TW108121204A
Other languages
Chinese (zh)
Other versions
TW202101363A (en
Inventor
林淑芬
宋政隆
田文
陳皓遠
陳逸航
Original Assignee
中國信託商業銀行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中國信託商業銀行股份有限公司 filed Critical 中國信託商業銀行股份有限公司
Priority to TW108121204A priority Critical patent/TWI700664B/en
Application granted granted Critical
Publication of TWI700664B publication Critical patent/TWI700664B/en
Publication of TW202101363A publication Critical patent/TW202101363A/en

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一種文本處理系統,包含一儲存模組及一處理模組。該儲存模組儲存有多個連結網址及多個關鍵詞,其中,對於每一連結網址,該處理模組根據該連結網址,獲得該連結網址所對應的一欲分析文本,對於每一欲分析文本,該處理模組利用斷詞演算法,獲得對應該欲分析文本的多個斷詞,該處理模組根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞,該處理模組根據每一目標文本所對應的該至少一目標斷詞,利用聚類分群演算法,將該等目標文本分為至少一群集。A text processing system includes a storage module and a processing module. The storage module stores multiple destination URLs and multiple keywords. For each destination URL, the processing module obtains a text to be analyzed corresponding to the destination URL according to the destination URL, and for each destination URL to be analyzed Text, the processing module uses a word segmentation algorithm to obtain multiple word breaks corresponding to the text to be analyzed. The processing module uses the word breaks and keywords corresponding to each text to be analyzed, from the In the text to be analyzed, a plurality of target texts and at least one target segmentation corresponding to them are obtained, and the processing module uses a clustering and grouping algorithm according to the at least one target segmentation corresponding to each target text. The target text is divided into at least one cluster.

Description

文本處理方法及其系統Text processing method and system

本發明是有關於一種相關於自然語言處理技術的文本處理系統,特別是指一種應用於洗錢防制領域的文本處理系統。The invention relates to a text processing system related to natural language processing technology, in particular to a text processing system applied to the field of money laundering prevention and control.

洗錢防制/打擊資助恐怖主義(AML/CFT,Anti-Money Laundering/Combating the Financing of Terrorism)相關法規與作業規範日趨嚴謹,辨識客戶身分、客戶盡職調查作業亦愈趨繁複,AML作業相關人力投入亦均隨之大幅增加。Anti-Money Laundering/Combating the Financing of Terrorism (AML/CFT, Anti-Money Laundering/Combating the Financing of Terrorism) laws and regulations and operating standards are becoming more and more stringent. Customer identification and customer due diligence operations are becoming more and more complicated. Manpower input for AML operations Both also increased substantially.

現行AML姓名檢核作業若觸及負面新聞名單,則需逐條檢閱每則新聞、逐字閱讀新聞內容為真警報或假警報外,亦需同時判斷新聞事件主角與所屬客戶是否為同一人,必須參考分散於內部不同系統與網站的資料以判斷客戶身分是否相同。而必須費時跨系統查找各系統交易資料、耗工蒐集彙整客戶及其關係關聯人資訊,故造成姓名檢核作業速度緩慢、產生作業錯誤風險機率較高等之人工作業痛點。隨著各項金融業務快速成長、疑似洗錢或資恐交易態樣持續完善發展、AML系統警示機制功能不斷開發下,觸及負面新聞之姓名檢核案件亦同步大幅增加,形成作業人員工作超載。If the current AML name check operation touches the negative news list, you need to review each news item by item, read the news verbatim as a true alert or false alert, and also determine whether the protagonist of the news event and the customer belong to the same person. Refer to the information scattered in different internal systems and websites to determine whether the customer identity is the same. It is time-consuming to search for transaction data of each system across systems, and labor to collect and aggregate information on customers and their related associates, resulting in slow name verification operations and high risk of manual operations, such as operational errors. With the rapid growth of various financial businesses, the continuous improvement and development of suspected money laundering or terrorist transactions, and the continuous development of the AML system warning mechanism, the number of name check cases that touched negative news has also increased significantly, resulting in an overload of operators.

因此,為紓減人力配置重擔與減少錯誤判斷,運用自然語言分析相關於AML文本,以提升案件審查效率,強化負面新聞案件審查品質與作業一致性,減少作業人力需求並降低合規成本。Therefore, in order to reduce the burden of manpower allocation and reduce misjudgments, natural language analysis is used to analyze relevant AML texts to improve the efficiency of case review, strengthen the quality of review of negative news cases and work consistency, reduce labor requirements and reduce compliance costs.

因此,本發明的目的,即在提供一種運用自然語言分析的文本處理方法。Therefore, the purpose of the present invention is to provide a text processing method using natural language analysis.

於是,本發明文本處理方法,藉由一電子裝置來實施,該電子裝置儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞,該文本處理方法包含一步驟(A)、一步驟(B)、一步驟(C),以及一步驟(D)。Therefore, the text processing method of the present invention is implemented by an electronic device that stores multiple URLs for linking to multiple texts to be analyzed, and multiple keywords. The text processing method includes a step (A ), one step (B), one step (C), and one step (D).

步驟(A)是對於每一連結網址,藉由該電子裝置,根據該連結網址,獲得該連結網址所對應的該欲分析文本。Step (A) is to use the electronic device to obtain the text to be analyzed corresponding to the destination URL according to the destination URL for each destination URL.

步驟(B)是對於每一欲分析文本,藉由該電子裝置,根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞。In step (B), for each text to be analyzed, using the electronic device, according to the text to be analyzed, a segmentation algorithm is used to obtain multiple word segments corresponding to the text to be analyzed.

步驟(C)是藉由該電子裝置,根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞。Step (C) is to use the electronic device to obtain multiple target texts and at least one corresponding one from the texts to be analyzed based on the word breaks and keywords corresponding to each text to be analyzed Target hyphenation.

步驟(D)是藉由該電子裝置,根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集。Step (D) is to use the electronic device to divide the target text into at least one cluster according to the at least one target segmentation corresponding to each target text by using a clustering algorithm.

本發明之另一目的,即在提供一種運用自然語言分析的文本處理系統。Another object of the present invention is to provide a text processing system using natural language analysis.

於是,本發明文本處理系統包含一儲存模組,以及一電連接該儲存模組的處理模組。Therefore, the text processing system of the present invention includes a storage module and a processing module electrically connected to the storage module.

該儲存模組儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞。The storage module stores multiple URLs for linking to multiple texts to be analyzed, and multiple keywords.

其中,對於每一連結網址,該處理模組根據該連結網址,獲得該連結網址所對應的該欲分析文本,對於每一欲分析文本,該處理模組根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞,該處理模組根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞,該處理模組根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集。Wherein, for each destination URL, the processing module obtains the text to be analyzed corresponding to the destination URL according to the destination URL, and for each text to be analyzed, the processing module uses a word segmentation according to the text to be analyzed Algorithm to obtain multiple word breaks corresponding to the text to be analyzed. The processing module obtains multiple targets from the text to be analyzed according to the word breaks and keywords corresponding to each text to be analyzed For the text and at least one target segmentation corresponding to the text, the processing module uses a clustering algorithm to divide the target text into at least one cluster according to the at least one target segmentation corresponding to each target text.

本發明之功效在於:藉由該處理模組自該等欲分析文本中,獲得多個目標文本及其所對應的該至少一目標斷詞,並利用該聚類分群演算法,將該等目標文本分為該至少一群集,如此一來,當於檢核作業逐條審查時,僅需要審查每一群集中的任一個目標文本即可達成與習知作法的相同功效,大大地提升案件審查效率,強化負面新聞案件審查品質與作業一致性,並減少作業人力需求並降低合規成本。The effect of the present invention is to obtain a plurality of target texts and the corresponding at least one target segmentation from the texts to be analyzed by the processing module, and use the clustering algorithm to obtain the target texts. The texts are divided into at least one cluster. In this way, when reviewing one by one, you only need to review any target text in each cluster to achieve the same effect as the conventional practice, greatly improving the efficiency of case review , Strengthen the consistency of the review quality and operation of negative news cases, reduce the demand for operation manpower and reduce compliance costs.

在本發明被詳細描述之前,應當注意在以下的說明內容中,類似的元件是以相同的編號來表示。Before the present invention is described in detail, it should be noted that in the following description, similar elements are represented by the same numbers.

參閱圖1,本發明文本處理系統是應用於洗錢防制的文本處理系統,其實施例包含一電子裝置1。該電子裝置1包含一儲存模組11、一顯示模組12,以及一電連接該儲存模組11及該顯示模組12的處理模組13,在本實施例中,特別是應用於洗錢防制。Referring to FIG. 1, the text processing system of the present invention is a text processing system applied to money laundering prevention, and its embodiment includes an electronic device 1. The electronic device 1 includes a storage module 11, a display module 12, and a processing module 13 electrically connected to the storage module 11 and the display module 12. In this embodiment, it is particularly used for money laundering prevention. system.

該儲存模組11儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞。在本實施例中,該等關鍵詞是多個相關於洗錢領域的洗錢關鍵詞。The storage module 11 stores multiple URLs for linking to multiple texts to be analyzed, and multiple keywords. In this embodiment, the keywords are multiple money laundering keywords related to the money laundering field.

在該實施例中,該電子裝置1之實施態樣例如為一個人電腦、一伺服器或一雲端主機,但不以此為限。In this embodiment, the implementation of the electronic device 1 is, for example, a personal computer, a server, or a cloud host, but it is not limited to this.

參閱圖2,以下將藉由本發明應用於洗錢防制的文本處理系統執行一應用於洗錢防制的文本處理方法來說明該電腦裝置1之該儲存模組11、該顯示模組12,以及該處理模組13各元件的運作細節,該文本處理方法包含一步驟51、一步驟52、一步驟53,以及一步驟54。Referring to FIG. 2, the text processing system applied to money laundering prevention of the present invention executes a text processing method applied to money laundering prevention to illustrate the storage module 11, the display module 12, and the computer device 1 The operation details of the components of the processing module 13. The text processing method includes a step 51, a step 52, a step 53, and a step 54.

在步驟51中,對於每一連結網址,該處理模組13根據該連結網址,獲得該連結網址所對應的該欲分析文本。In step 51, for each destination URL, the processing module 13 obtains the text to be analyzed corresponding to the destination URL according to the destination URL.

在步驟52中,對於每一欲分析文本,該處理模組13根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞。值得特別說明的是,在本實施例中,該斷詞演算法係為[Ma, Wei-Yun and Chen, Keh-Jiann]於 2003所發表之習知技術。In step 52, for each text to be analyzed, the processing module 13 uses a word segmentation algorithm to obtain multiple word segments corresponding to the text to be analyzed according to the text to be analyzed. It is worth noting that in this embodiment, the word segmentation algorithm is a conventional technique published by [Ma, Wei-Yun and Chen, Keh-Jiann] in 2003.

在步驟53中,該處理模組13根據每一欲分析文本所對應的該等斷詞及該等洗錢關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞。In step 53, the processing module 13 obtains a plurality of target texts and at least one corresponding one from the texts to be analyzed according to the word segmentation and the money laundering keywords corresponding to each text to be analyzed. Target hyphenation.

參閱圖3,值得特別說明的是,步驟53還進一步包含一子步驟531,以及一子步驟532。Referring to FIG. 3, it is worth noting that step 53 further includes a sub-step 531 and a sub-step 532.

在子步驟531中,對於每一欲分析文本,該處理模組13判定該欲分析文本所對應的該等斷詞中是否存在於與該等洗錢關鍵詞之其中任一者相符的至少一目標斷詞。當該處理模組13判定出該欲分析文本存在有對應的該至少一目標斷詞時,進行流程步驟532;當該處理模組13判定出該欲分析文本不存在有對應的該至少一目標斷詞時,結束該應用於洗錢防制的文本處理方法。In sub-step 531, for each text to be analyzed, the processing module 13 determines whether the word segmentation corresponding to the text to be analyzed contains at least one target that matches any one of the money laundering keywords. Hyphenation. When the processing module 13 determines that the text to be analyzed has the corresponding at least one target segmentation, proceed to step 532; when the processing module 13 determines that the text to be analyzed does not have the corresponding at least one target When the word is broken, the text processing method applied to money laundering prevention is ended.

在子步驟532中,對於每一欲分析文本,該處理模組13將該欲分析文本作為該目標文本,並獲得其所對應的該至少一目標斷詞。In sub-step 532, for each text to be analyzed, the processing module 13 uses the text to be analyzed as the target text, and obtains the at least one target segmentation corresponding to it.

在步驟54中,該處理模組13根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集並顯示於該顯示模組12。In step 54, the processing module 13 uses a clustering algorithm to divide the target text into at least one cluster according to the at least one target word segmentation corresponding to each target text and display it on the display module 12.

參閱圖4,值得特別說明的是,步驟54還進一步包含一子步驟541,以及一子步驟542。4, it is worth noting that step 54 further includes a sub-step 541 and a sub-step 542.

在子步驟541中,對於每一目標文本,該處理模組13根據該目標文本所對應的該至少一目標斷詞,利用一用於將文本轉成數值向量的文本嵌入模型,獲得對應該目標文本的一文本向量組。其中,該處理模組13係根據每一目標文本所對應的該至少一目標斷詞,利用該文本嵌入模型,獲得每一目標文本所對應的該文本向量組。值得特別說明的是,在本實施例中,該文本嵌入模型係為[Le and Mikolov]於2014年所發表的 PV-DBOW(Paragraph Vector - Distributed Bag of Words),但不以此為限。In sub-step 541, for each target text, the processing module 13 uses a text embedding model for converting the text into a numerical vector according to the at least one target segmentation corresponding to the target text to obtain the corresponding target A text vector group of text. Wherein, the processing module 13 obtains the text vector group corresponding to each target text by using the text embedding model according to the at least one target segmentation corresponding to each target text. It is worth noting that in this embodiment, the text embedding model is PV-DBOW (Paragraph Vector-Distributed Bag of Words) published by [Le and Mikolov] in 2014, but it is not limited to this.

在子步驟542中,根據每一目標文本所對應的該文本向量組,利用該聚類分群演算法,將該等目標文本分為該至少一群集並顯示於該顯示模組12。其中,每一群集係為由樹狀結構表示的樹。值得特別說明的是,在本實施例中,該聚類分群演算法係為[Zhang et al]於1996年所發表的平衡式反覆化簡和層級分群法BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies),但不以此為限。In sub-step 542, according to the text vector group corresponding to each target text, the clustering algorithm is used to divide the target text into the at least one cluster and display it on the display module 12. Among them, each cluster system is a tree represented by a tree structure. It is worth noting that in this embodiment, the clustering algorithm is the Balanced Iterative Reducing and Clustering using Hierarchies (Balanced Iterative Reducing and Clustering using Hierarchies) published by [Zhang et al] in 1996. , But not limited to this.

參閱圖5,值得特別說明的是,子步驟542還進一步包含一子步驟542A、一子步驟542B、一子步驟542C、一子步驟542D、一子步驟542E,以及一子步驟542F。5, it is worth noting that the sub-step 542 further includes a sub-step 542A, a sub-step 542B, a sub-step 542C, a sub-step 542D, a sub-step 542E, and a sub-step 542F.

在子步驟542A中,該處理模組13將一欲分群文本向量組歸類為一候選群集,該欲分群文本向量組為該等文本向量組之其中一者。In sub-step 542A, the processing module 13 classifies a text vector group to be grouped into a candidate cluster, and the text vector group to be grouped is one of the text vector groups.

在子步驟542B中,該處理模組13判定下一個欲分群文本向量組是否屬於當前所存在的候選群集之其中一者,該下一欲分群文本向量組為尚未被歸類的文本向量組之其中一者。當該處理模組13判定出定該下一個欲分群文本向量組屬於該當前所存在的候選群集之其中之該者時,進行流程子步驟542C;當該處理模組13判定出定該下一個欲分群文本向量組不屬於任一候選群集時,進行流程子步驟542D。特別地,該處理模組13係藉由判定該下一個欲分群文本向量組於加入當前所存在的候選群集之其該者後,當前所存在的候選群集之其該者於向量空間中整體距離之遠近是否超過一預設閾值,以判定該下一個欲分群文本向量組是否屬於當前所存在的候選群集之其中該者。In sub-step 542B, the processing module 13 determines whether the next text vector group to be grouped belongs to one of the currently existing candidate clusters, and the next text vector group to be grouped is one of the text vector groups that have not yet been classified. One of them. When the processing module 13 determines that the next text vector group to be grouped belongs to that one of the currently existing candidate clusters, proceed to sub-step 542C; when the processing module 13 determines that the next When the text vector group to be grouped does not belong to any candidate cluster, the process sub-step 542D is performed. In particular, the processing module 13 determines the overall distance of the current candidate cluster in the vector space after the next text vector group to be grouped is added to the other of the currently existing candidate clusters Whether the distance exceeds a preset threshold to determine whether the next text vector group to be grouped belongs to one of the currently existing candidate clusters.

在子步驟542C中,該處理模組13將該下一個欲分群文本向量組歸類為當前所存在的候選群集之其中之該者。In sub-step 542C, the processing module 13 classifies the next text vector group to be grouped as one of the currently existing candidate clusters.

在子步驟542D中,該處理模組13將該下一個欲分群文本向量組歸類為另一新的候選群集。In sub-step 542D, the processing module 13 classifies the next text vector group to be grouped as another new candidate cluster.

在子步驟542E中,該處理模組13判定是否還有尚未被歸類的文本向量組。當該處理模組13判定還有尚未被歸類的文本向量組時,回到流程子步驟542B;當該處理模組13判定無任何尚未被歸類的文本向量組時,進行流程子步驟542F。In sub-step 542E, the processing module 13 determines whether there are any text vector groups that have not yet been classified. When the processing module 13 determines that there are text vector groups that have not yet been classified, it returns to the process sub-step 542B; when the processing module 13 determines that there are no text vector groups that have not been classified, it proceeds to the process sub-step 542F .

在子步驟542F中,該處理模組13將當前所存在的候選群集作為該至少一群集並顯示於該顯示模組12。In sub-step 542F, the processing module 13 uses the currently existing candidate cluster as the at least one cluster and displays it on the display module 12.

綜上所述,本發明應用於洗錢防制的文本處理系統,藉由該處理模組13自該等欲分析文本中,篩選出多個與洗錢相關的目標文本及其所對應的該至少一目標斷詞,接著,利用該文本嵌入模型,將每一目標文本所對應的該至少一目標斷詞,轉換為每一目標文本所對應的該文本向量組,再利用該聚類分群演算法,將該等目標文本分為各個由樹狀結構所表示的該至少一群集,如此一來,當於檢核作業逐條審查時,僅需要審查每一群集之樹根所代表的該目標文本,即可達成與習知作法的相同功效,大大地提升案件審查效率,強化負面新聞案件審查品質與作業一致性,並減少作業人力需求並降低合規成本。因此,故確實能達成本發明的目的。In summary, the present invention is applied to a text processing system for money laundering prevention. The processing module 13 filters out a plurality of target texts related to money laundering and the corresponding at least one text from the texts to be analyzed. Target segmentation, and then using the text embedding model to convert the at least one target segmentation corresponding to each target text into the text vector group corresponding to each target text, and then using the clustering and grouping algorithm, The target texts are divided into the at least one cluster represented by the tree structure. In this way, when reviewing one by one during the verification operation, only the target text represented by the root of each cluster needs to be reviewed, namely It can achieve the same effect as the conventional practice, greatly improve the efficiency of case review, strengthen the quality and consistency of work in the review of negative news cases, and reduce labor requirements and reduce compliance costs. Therefore, it can indeed achieve the purpose of the invention.

惟以上所述者,僅為本發明的實施例而已,當不能以此限定本發明實施的範圍,凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾,皆仍屬本發明專利涵蓋的範圍內。However, the above are only examples of the present invention. When the scope of implementation of the present invention cannot be limited by this, all simple equivalent changes and modifications made in accordance with the scope of the patent application of the present invention and the content of the patent specification still belong to Within the scope of the patent for the present invention.

1:電子裝置1: Electronic device

11:儲存模組11: Storage module

12:顯示模組12: Display module

13:處理模組13: Processing module

51~54:步驟51~54: steps

531~532:子步驟531~532: Sub-step

541~542:子步驟541~542: Sub-step

542A~542F:子步驟542A~542F: sub-step

本發明的其他的特徵及功效,將於參照圖式的實施方式中清楚地呈現,其中: 圖1是一方塊圖,說明本發明文本處理系統的一實施例; 圖2是一流程圖,說明該實施例所執行之一文本處理方法; 圖3是一流程圖,說明該文本處理方法如何獲得一目標文本及其所對應的至少一目標斷詞的細部流程; 圖4是一流程圖,說明該文本處理方法如何獲得所有目標文本的文本向量組,並將其分為至少一群集的細部流程;及 圖5是一流程圖,說明該文本處理方法如何將所有目標文本分為至少一群集的細部流程。The other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, in which: Figure 1 is a block diagram illustrating an embodiment of the text processing system of the present invention; Figure 2 is a flowchart illustrating A text processing method executed by this embodiment; Figure 3 is a flowchart illustrating how the text processing method obtains a target text and the detailed flow of at least one target segmentation corresponding to it; Figure 4 is a flowchart illustrating How the text processing method obtains the text vector group of all target texts and divides them into at least one cluster of detailed processes; and Figure 5 is a flowchart illustrating how the text processing method divides all target texts into at least one cluster Detailed process.

1:電子裝置 1: Electronic device

11:儲存模組 11: Storage module

12:顯示模組 12: Display module

13:處理模組 13: Processing module

Claims (8)

一種文本處理方法,藉由一電子裝置來實施,該電子裝置儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞,該文本處理方法包含以下步驟:(A)對於每一連結網址,藉由該電子裝置,根據該連結網址,獲得該連結網址所對應的該欲分析文本;(B)對於每一欲分析文本,藉由該電子裝置,根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞;(C)藉由該電子裝置,根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞,每一目標文本之至少一目標斷詞相符於該等關鍵詞之其中任一者;及(D)藉由該電子裝置,根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集,其中,步驟(D)包含以下步驟,(D-1)對於每一目標文本,藉由該電子裝置,根據該目標文本所對應的該至少一目標斷詞,利用一用於將文本轉成數值向量的文本嵌入模型,獲得對應該目標文本的一文本向量組,及(D-2)藉由該電子裝置,根據每一目標文本所對應的該文本向量組,利用該聚類分群演算法,將該等目標文本分為該至少一群集。 A text processing method implemented by an electronic device storing a plurality of URLs for linking to a plurality of texts to be analyzed and a plurality of keywords. The text processing method includes the following steps: (A) For each destination URL, by the electronic device, according to the destination URL, obtain the text to be analyzed corresponding to the destination URL; (B) for each text to be analyzed, by the electronic device, according to the text to be analyzed, Use a word segmentation algorithm to obtain multiple word breaks corresponding to the text to be analyzed; (C) Using the electronic device, according to the word breaks and keywords corresponding to each text to be analyzed, To analyze the text, obtain a plurality of target texts and at least one target segmentation corresponding thereto, and at least one target segmentation of each target text matches any one of the keywords; and (D) by the The electronic device uses a clustering algorithm to divide the target text into at least one cluster according to the at least one target segmentation corresponding to each target text. Step (D) includes the following steps: (D- 1) For each target text, using the electronic device, according to the at least one target segmentation corresponding to the target text, use a text embedding model for converting the text into a numeric vector to obtain a corresponding target text The text vector group, and (D-2) by the electronic device, according to the text vector group corresponding to each target text, the clustering algorithm is used to divide the target text into the at least one cluster. 如請求項1所述的文本處理方法,其中,步驟(C)包含以下 步驟:(C-1)對於每一欲分析文本,藉由該電子裝置,判定該欲分析文本所對應的該等斷詞中是否存在於與該等關鍵詞之其中任一者相符的至少一目標斷詞;及(C-2)對於每一欲分析文本,當該電子裝置判定出判定該欲分析文本存在有對應的該至少一目標斷詞時,藉由該電子裝置,將該欲分析文本作為該目標文本,並獲得其所對應的該至少一目標斷詞。 The text processing method according to claim 1, wherein step (C) includes the following Steps: (C-1) For each text to be analyzed, use the electronic device to determine whether the word segmentation corresponding to the text to be analyzed exists in at least one that matches any of the keywords Target segmentation; and (C-2) For each text to be analyzed, when the electronic device determines that there is a corresponding at least one target segmentation in the text to be analyzed, use the electronic device to analyze the The text is used as the target text, and the at least one target segmentation corresponding to it is obtained. 如請求項1所述的文本處理方法,其中,(D-2-1)藉由該電子裝置,將一欲分群文本向量組歸類為一候選群集,該欲分群文本向量組為該等文本向量組之其中一者;(D-2-2)藉由該電子裝置,判定下一個欲分群文本向量組是否屬於當前所存在的候選群集之其中一者,該下一欲分群文本向量組為尚未被歸類的文本向量組之其中一者;(D-2-3)當判定出定該下一個欲分群文本向量組屬於該當前所存在的候選群集之其中之該者時,藉由該電子裝置,將該下一個欲分群文本向量組歸類為當前所存在的候選群集之其中之該者;(D-2-4)當判定出定該下一個欲分群文本向量組不屬於任一候選群集時,藉由該電子裝置,將該下一個欲分群文本向量組歸類為另一新的候選群集;及(D-2-5)藉由該電子裝置,重複執行步驟(D-2-2)直 到所有尚未被歸類的文本向量組被歸類完成,當前所存在的候選群集即為該至少一群集。 The text processing method according to claim 1, wherein (D-2-1) uses the electronic device to classify a text vector group to be grouped into a candidate cluster, and the text vector group to be grouped is the text One of the vector groups; (D-2-2) Using the electronic device to determine whether the next text vector group to be grouped belongs to one of the currently existing candidate clusters, and the next text vector group to be grouped is One of the text vector groups that have not yet been classified; (D-2-3) When it is determined that the next text vector group to be grouped belongs to that one of the currently existing candidate clusters, use the The electronic device classifies the next text vector group to be grouped as one of the currently existing candidate clusters; (D-2-4) When it is determined that the next text vector group to be grouped does not belong to any In the case of a candidate cluster, use the electronic device to classify the next text vector group to be grouped as another new candidate cluster; and (D-2-5) use the electronic device to repeat step (D-2) -2) straight When all the text vector groups that have not yet been classified are classified, the currently existing candidate cluster is the at least one cluster. 如請求項1所述的文本處理方法,其中,在步驟(D-2)中,該聚類分群演算法係為平衡式反覆化簡和層級分群法,每一群集係為樹狀結構。 The text processing method according to claim 1, wherein in step (D-2), the clustering algorithm is a balanced iterative simplification and hierarchical clustering method, and each cluster is a tree structure. 一種文本處理系統,包含:一儲存模組,儲存有多個用於連結至多個欲分析文本的連結網址,以及多個關鍵詞;一處理模組,電連接該儲存模組;其中,對於每一連結網址,該處理模組根據該連結網址,獲得該連結網址所對應的該欲分析文本,對於每一欲分析文本,該處理模組根據該欲分析文本,利用一斷詞演算法,獲得對應該欲分析文本的多個斷詞,該處理模組根據每一欲分析文本所對應的該等斷詞及該等關鍵詞,自該等欲分析文本中,獲得多個目標文本及其所對應的至少一目標斷詞,每一目標文本之至少一目標斷詞相符於該等關鍵詞之其中任一者,該處理模組根據每一目標文本所對應的該至少一目標斷詞,利用一聚類分群演算法,將該等目標文本分為至少一群集,對於每一目標文本,該處理模組根據該目標文本所對應的該至少一目標斷詞,利用一用於將文本轉成數值向量的文本嵌入模型,獲得對應該目標文本的一文本向量組,該處理模組根據每一目標文本所對應的該文本向量組,利用該聚類分群演算法,將該等目標文本分為該至少一群集。 A text processing system includes: a storage module, which stores a plurality of link URLs for linking to a plurality of texts to be analyzed, and a plurality of keywords; a processing module, which is electrically connected to the storage module; wherein, for each A destination URL, the processing module obtains the text to be analyzed corresponding to the destination URL according to the destination URL. For each text to be analyzed, the processing module uses a word segmentation algorithm to obtain Corresponding to multiple word breaks of the text to be analyzed, the processing module obtains multiple target texts and their corresponding texts from the text to be analyzed according to the word breaks and keywords corresponding to each text to be analyzed. Corresponding at least one target word segmentation, at least one target word segmentation of each target text matches any of the keywords, the processing module uses the at least one target word segmentation corresponding to each target text A clustering algorithm is used to divide the target text into at least one cluster. For each target text, the processing module uses a method for converting the text into at least one target segmentation corresponding to the target text. The text embedding model of the numerical vector obtains a text vector group corresponding to the target text. The processing module uses the clustering algorithm to divide the target text into the text vector group corresponding to each target text. The at least one cluster. 如請求項5所述的文本處理系統,其中,對於每一欲分析文本,該處理模組判定該欲分析文本所對應的該等斷詞中是否存在於與該等關鍵詞之其中任一者相符的至少一目標斷詞,對於每一欲分析文本,當該處理模組判定出判定該欲分析文本存在有對應的該至少一目標斷詞時,該處理模組將該欲分析文本作為該目標文本,並獲得其所對應的該至少一目標斷詞。 The text processing system according to claim 5, wherein, for each text to be analyzed, the processing module determines whether the word breaks corresponding to the text to be analyzed exist in any one of the keywords At least one target segmentation that matches, for each text to be analyzed, when the processing module determines that the text to be analyzed has the corresponding at least one target segmentation, the processing module uses the text to be analyzed as the Target text, and obtain the at least one target segmentation corresponding to it. 如請求項5所述的文本處理系統,其中,該處理模組將一欲分群文本向量組歸類為一候選群集,該欲分群文本向量組為該等文本向量組之其中一者,該處理模組判定下一個欲分群文本向量組是否屬於當前所存在的候選群集之其中一者,該下一欲分群文本向量組為尚未被歸類的文本向量組之其中一者,當該處理模組判定出定該下一個欲分群文本向量組屬於該當前所存在的候選群集之其中之該者時,該處理模組將該下一個欲分群文本向量組歸類為當前所存在的候選群集之其中之該者,當該處理模組判定出定該下一個欲分群文本向量組不屬於任一候選群集時,該處理模組將該下一個欲分群文本向量組歸類為另一新的候選群集,該處理模組重覆地判定並歸類下一個尚未被歸類的文本向量組之其中一者,直到將所有尚未被歸類的文本向量組被歸類完成,當前所存在的候選群集即為該至少一群集。 The text processing system according to claim 5, wherein the processing module classifies a text vector group to be grouped into a candidate cluster, the text vector group to be grouped is one of the text vector groups, and the processing The module determines whether the next text vector group to be grouped belongs to one of the currently existing candidate clusters. The next text vector group to be grouped is one of the text vector groups that have not yet been classified. When the processing module When determining that the next text vector group to be grouped belongs to one of the currently existing candidate clusters, the processing module classifies the next text vector group to be grouped into one of the currently existing candidate clusters For that, when the processing module determines that the next text vector group to be grouped does not belong to any candidate cluster, the processing module classifies the next text vector group to be grouped as another new candidate cluster , The processing module repeatedly determines and classifies one of the next text vector groups that have not yet been classified until the classification of all the text vector groups that have not yet been classified is completed, and the current candidate cluster is Is the at least one cluster. 如請求項5所述的文本處理系統,其中,該聚類分群演算法係為平衡式反覆化簡和層級分群法,每一群集係為樹狀 結構。 The text processing system according to claim 5, wherein the clustering and grouping algorithm is a balanced iterative simplification and hierarchical grouping method, and each cluster is a tree-like structure.
TW108121204A 2019-06-19 2019-06-19 Text processing method and system TWI700664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW108121204A TWI700664B (en) 2019-06-19 2019-06-19 Text processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW108121204A TWI700664B (en) 2019-06-19 2019-06-19 Text processing method and system

Publications (2)

Publication Number Publication Date
TWI700664B true TWI700664B (en) 2020-08-01
TW202101363A TW202101363A (en) 2021-01-01

Family

ID=73002939

Family Applications (1)

Application Number Title Priority Date Filing Date
TW108121204A TWI700664B (en) 2019-06-19 2019-06-19 Text processing method and system

Country Status (1)

Country Link
TW (1) TWI700664B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200719172A (en) * 2005-11-04 2007-05-16 Webgenie Information Ltd Method for automatically detecting similar documents
CN101763404A (en) * 2009-12-10 2010-06-30 陕西鼎泰科技发展有限责任公司 Network text data detection method based on fuzzy cluster
US20130144874A1 (en) * 2010-11-05 2013-06-06 Nextgen Datacom, Inc. Method and system for document classification or search using discrete words
CN103631809A (en) * 2012-08-24 2014-03-12 宏碁股份有限公司 Data clustering device and method
TWM585945U (en) * 2019-06-19 2019-11-01 中國信託商業銀行股份有限公司 Text processing system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200719172A (en) * 2005-11-04 2007-05-16 Webgenie Information Ltd Method for automatically detecting similar documents
CN101763404A (en) * 2009-12-10 2010-06-30 陕西鼎泰科技发展有限责任公司 Network text data detection method based on fuzzy cluster
US20130144874A1 (en) * 2010-11-05 2013-06-06 Nextgen Datacom, Inc. Method and system for document classification or search using discrete words
CN103631809A (en) * 2012-08-24 2014-03-12 宏碁股份有限公司 Data clustering device and method
TWM585945U (en) * 2019-06-19 2019-11-01 中國信託商業銀行股份有限公司 Text processing system

Also Published As

Publication number Publication date
TW202101363A (en) 2021-01-01

Similar Documents

Publication Publication Date Title
NL2012438B1 (en) Resolving similar entities from a database.
CN105915555B (en) Method and system for detecting network abnormal behavior
US9317613B2 (en) Large scale entity-specific resource classification
CN104112026B (en) A kind of short message text sorting technique and system
WO2022134794A1 (en) Method and apparatus for processing public opinions about news event, storage medium, and computer device
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN104077407B (en) A kind of intelligent data search system and method
WO2023093100A1 (en) Method and apparatus for identifying abnormal calling of api gateway, device, and product
WO2023284132A1 (en) Method and system for analyzing cloud platform logs, device, and medium
CN110532352B (en) Text duplication checking method and device, computer readable storage medium and electronic equipment
CN105975547B (en) Based on content web document detection method approximate with position feature
WO2012083874A1 (en) Webpage information detection method and system
CN103455758A (en) Method and device for identifying malicious website
CN117971606B (en) Log management system and method based on elastic search
CN109271614A (en) A kind of data duplicate checking method
CN111639077A (en) Data management method and device, electronic equipment and storage medium
CN114116811B (en) Log processing method, device, equipment and storage medium
CN109471934B (en) Financial risk clue mining method based on Internet
CN116841779A (en) Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
CN113806492A (en) Record generation method, device and equipment based on semantic recognition and storage medium
TWI700664B (en) Text processing method and system
CN109409091B (en) Method, device and equipment for detecting Web page and computer storage medium
CN112308251A (en) Work order assignment method and system based on machine learning
TWM585945U (en) Text processing system
CN111738290A (en) Image detection method, model construction and training method, device, equipment and medium