TWI672597B

TWI672597B - Automatic text labeling method and system

Info

Publication number: TWI672597B
Application number: TW107142222A
Authority: TW
Inventors: 趙式隆; 林奕辰; 沈昇勳
Original assignee: 洽吧智能股份有限公司
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2019-09-21
Also published as: TW202020690A

Abstract

本發明揭露一種自動文本標籤方法與系統。該系統包括一人工智慧語言處理模組、一人工智慧文本分類模組、一標籤庫以及一人工智慧標籤輸出模組。該人工智慧語言處理模組用來接收一輸入文字以及將該輸入文字轉換為至少一特定維度的量化特徵；該人工智慧文本分類模組用來將該至少一特定維度的量化特徵轉換為至少一特定信心水準的特徵矩陣；該標籤庫用來儲存需要輸出之至少一標籤；以及該人工智慧標籤輸出模組用來將該至少一特定信心水準的特徵矩陣轉換為至少一文本標籤以及一相對應的量化信心水準。The invention discloses an automatic text labeling method and system. The system comprises an artificial intelligence language processing module, an artificial intelligence text classification module, a label library and an artificial intelligence label output module. The artificial intelligence language processing module is configured to receive an input text and convert the input text into at least one specific dimension of the quantized feature; the artificial intelligence text classification module is configured to convert the at least one specific dimension of the quantized feature into at least one a characteristic matrix of a specific confidence level; the tag library is configured to store at least one tag that needs to be output; and the artificial intelligence tag output module is configured to convert the at least one specific confidence level feature matrix into at least one text tag and a corresponding Quantitative confidence level.

Description

Automatic text labeling method and system

本發明是關於一種自動文本標籤方法與系統；特別關於一種人工智慧解析文本內容文字的自動文本標籤方法與系統，用來將該文本內容文字給予可能的標籤，讓分類文本可以透過人工智慧加快人類分類效率，或取代人類完成文本分類。The invention relates to an automatic text labeling method and system; in particular to an automatic text labeling method and system for artificial intelligence to parse text content text, which is used to give the text content text a possible label, so that the classified text can accelerate humanity through artificial intelligence. Classification efficiency, or replace human completion of text classification.

在現今的資料科技時代，各種各樣的文本資料層出不窮，如醫生的診斷書、多樣化的新聞、海量的自媒體原創文章。面對如此豐富多樣的資訊，人們迫切需要一些自動化工具來讓他們從浩瀚的資訊汪洋中準確、快速地找到自己需要的關鍵信息，因此，如何快速並精確地解析文本內容文字，並給予可能的標籤，讓分類文本可以透過人工智慧加快人類分類效率，是值得本領域具有通常知識者去思量的重要課題之一。In today's data technology era, a variety of texts emerge in an endless stream, such as doctor's diagnosis, diversified news, and a large number of original articles from the media. Faced with such a rich variety of information, people urgently need some automated tools to help them find the key information they need accurately and quickly from the vast information ocean. Therefore, how to quickly and accurately analyze the text content and give it possible Labeling, which allows classified texts to accelerate human classification efficiency through artificial intelligence, is one of the important topics worthy of consideration by those with ordinary knowledge in the field.

本發明提供一種人工智慧解析文本內容文字的自動文本標籤方法與系統，用來將該文本內容文字給予可能的標籤，讓分類文本可以透過人工智慧加快人類分類效率，或取代人類完成文本分類。本發明提供一種自動文本標籤系統。該系統包括一人工智慧語言處理模組、一人工智慧文本分類模組、一標籤庫以及一人工智慧標籤輸出模組。該人工智慧語言處理模組用來接收一輸入文字以及將該輸入文字轉換為至少一特定維度的量化特徵；該人工智慧文本分類模組用來將該至少一特定維度的量化特徵轉換為至少一特定信心水準的特徵矩陣；該標籤庫用來儲存需要輸出之至少一標籤；以及該人工智慧標籤輸出模組用來將該至少一特定信心水準的特徵矩陣轉換為至少一文本標籤以及一相對應的量化信心水準。本發明另提供一種自動文本標籤方法，包括下列步驟：首先，接收一輸入文字以及將該輸入文字轉換為至少一特定維度的量化特徵。接著，將該至少一特定維度的量化特徵轉換為至少一特定信心水準的特徵矩陣。之後，儲存需要輸出之至少一標籤。再來，將該至少一特定信心水準的特徵矩陣轉換為至少一文本標籤以及一相對應的量化信心水準。The invention provides an automatic text labeling method and system for artificially analysing text content text, which is used to give the text content text to possible labels, so that the classified text can accelerate the efficiency of human classification through artificial intelligence, or replace human beings to complete text classification. The present invention provides an automatic text labeling system. The system comprises an artificial intelligence language processing module, an artificial intelligence text classification module, a label library and an artificial intelligence label output module. The artificial intelligence language processing module is configured to receive an input text and convert the input text into at least one specific dimension of the quantized feature; the artificial intelligence text classification module is configured to convert the at least one specific dimension of the quantized feature into at least one a characteristic matrix of a specific confidence level; the tag library is configured to store at least one tag that needs to be output; and the artificial intelligence tag output module is configured to convert the at least one specific confidence level feature matrix into at least one text tag and a corresponding Quantitative confidence level. The present invention further provides an automatic text labeling method comprising the steps of: first receiving an input text and converting the input text into at least one specific dimension of the quantized feature. Then, the quantized feature of the at least one specific dimension is converted into a feature matrix of at least one specific confidence level. After that, at least one tag that needs to be output is stored. Then, the at least one specific confidence level feature matrix is converted into at least one text label and a corresponding quantization confidence level.

參照本文闡述的詳細內容和附圖說明是最好理解本發明。下面參照附圖會討論各種實施例。然而，本領域技術人員將容易理解，這裡關於附圖給出的詳細描述僅僅是為了解釋的目的，因為這些方法和系統可超出所描述的實施例。例如，所給出的教導和特定應用的需求可能產生多種可選的和合適的方法來實現在此描述的任何細節的功能。因此，任何方法可延伸超出所描述和示出的以下實施例中的特定實施選擇範圍。在說明書及後續的申請專利範圍當中使用了某些詞彙來指稱特定的元件。所屬領域中具有通常知識者應可理解，硬體製造商可能會用不同的名詞來稱呼同樣的元件。本說明書及後續的申請專利範圍並不以名稱的差異來作為區分元件的方式，而是以元件在功能上的差異來作為區分的準則。在通篇說明書及後續的請求項當中所提及的「包含」係為一開放式的用語，故應解釋成「包含但不限定於」。另外，「耦接」一詞在此係包含任何直接及間接的電氣連接手段。因此，若文中描述一第一裝置耦接於一第二裝置，則代表該第一裝置可直接電氣連接於該第二裝置，或透過其他裝置或連接手段間接地電氣連接至該第二裝置。請參照圖1，圖1係為本發明一種自動文本標籤系統100的方塊圖。如圖1所示，自動文本標籤系統100包括一人工智慧語言處理模組110、一人工智慧文本分類模組120、一標籤庫130以及一人工智慧標籤輸出模組140。該人工智慧語言處理模組110用來接收一輸入文字d以及將該輸入文字d轉換為至少一特定維度的量化特徵。也就是說，該人工智慧語言處理模組110會用來數值化非結構化的文字d。此人工智慧語言處理模組110例如是電性連接到一掃描裝置或一數位相機，該掃描裝置或數位相機獲取文件影像後，利用光學字元辨識(Optical Character Recognition)的技術，以將掃描或拍照後的文件影像上的文字數值化。關於將輸入文字d轉換為至少一特定維度的量化特徵，舉例來說，該人工智慧語言處理模組110利用單詞嵌入(word embedding)之技術將該輸入文字d轉換為一二維矩陣（TxW）的特徵矩陣，其中T為時間軸長度也就是輸入文本之序列長，而W則為文本分類核心之自設參數，此參數大小與本標籤系統之複雜度呈正相關，請注意，在本發明實施例中，該二維矩陣（TxW）的特徵矩陣亦可視為一個依照時間順序排列的一維矩陣。該人工智慧文本分類模組120用來將該至少一特定維度的量化特徵轉換為至少一特定信心水準的特徵矩陣。也就是說，該人工智慧文本分類模組120用來接收數值化後的文字特徵矩陣以及給定標籤該相對應的輸出信心水準的特徵矩陣。舉例來說，該人工智慧文本分類模組120接收到該二維矩陣（TxW）的特徵矩陣之後，該人工智慧文本分類模組120會利用類神經網絡（Neural Networks）進行處理來將該二維矩陣（TxW）的特徵矩陣轉換成一個一維矩陣，其中該陣列大小與標籤庫的詞庫大小等大。而類神經網絡（Neural Networks）可為遞歸神經網路（Recurrent Neural Network）、長短期記憶模型（Long Short-Term Memory）或是卷積神經網路（Convolutional Neural Network），請注意，此僅為本發明的實施例，並非本發明的限制條件。如圖1所示，該標籤庫130用來儲存需要輸出之至少一標籤，也就是說，文本中的核心詞語的標籤會儲存在該標籤庫130中，該人工智慧標籤輸出模組140會將該至少一特定信心水準的特徵矩陣轉換為至少一文本標籤以及一相對應的量化信心水準。也就是說，該人工智慧標籤輸出模組140用來將數值化的信心水準的特徵矩陣轉換為相對應的標籤，舉例來說，本發明會定義一損失函數（loss function），該損失函數可以但不限於是交叉熵(cross-entropy)、焦點損失(focal loss)、均方誤差MSE等方法來計算核心輸出之信心水準矩陣與真實結果之差距，差異越大則損失函數會得到越大的值。另外，該人工智慧標籤輸出模組140利用隨機梯度下降（SGD, Stochastic gradient descent）、Adagrad、AdaDelta、Adam、RMSProp等深度學習的演算法來迭代更新核心之參數，以準確預測其信心水準。請注意，這些深度學習的演算法僅為本發明的實施例，並非本發明的限制條件，凡是可以更優化參數的演算法皆符合本發明的精神，而落入本發明的範疇。在本發明一實施例中，該人工智慧語言處理模組110用來接收到診斷書文字內容後，該人工智慧語言處理模組110會提取診斷書文字內容的各種資訊，進行語義表徵，以及將診斷書文字內容轉換為該二維矩陣（TxW）的特徵矩陣。之後，該人工智慧文本分類模組120便可以算出各標籤所對應的信心水準，例如“腿部外傷”有多少信心水準的結果。該人工智慧標籤輸出模組140便將數值化的信心水準矩陣轉換為相對應的標籤，例如如果信心水準是設定為0.5，則該標籤庫130中的“腿部外傷”標籤對應到信心水準0.5以上的便會被判定對應到“腿部外傷”的標籤。依據上述的說明，熟知此項技藝人士便可輕易了解，在本發明其他實施例中，如果該人工智慧語言處理模組110接收到的診斷書文字內容為 “左足挫傷，左足第四趾遠端指骨骨折”，該人工智慧標籤輸出模組140便會轉換為標籤：(1)下肢挫傷 (2)軀幹壓砸傷。如果該人工智慧語言處理模組110接收到的診斷書文字內容為“左上肢撕裂傷（門診手術縫合３針）”，該人工智慧標籤輸出模組140便會轉換為標籤：(1)手開放性傷口，手指除外。或者，如果該人工智慧語言處理模組110接收到的診斷書文字內容為 “左側輸尿管下段結石合併阻塞性腎水腫”，該人工智慧標籤輸出模組140便會轉換為標籤：(1)腎水腫 (2)腎及輸尿管結石。請注意，上述標籤的實施例僅是用來說明本發明，並非是本發明的限制條件。同理，在本發明另一實施例中，利用本發明的方法和系統可以將病歷的事後核查提前至事中提醒。醫生書寫診斷書的同時，便可以實時提醒其不合規內容，從源頭杜絕非規範病歷的產生。本系統還能基於自然語言理解及醫療知識，自動識別醫生的診斷是否符合醫療規範，給診療上一道人工智能的保險。請參照圖2，圖2係為本發明一種自動文本標籤方法的流程圖，其包含（但不侷限於）以下的步驟(請注意，假若可獲得實質上相同的結果，則這些步驟並不一定要遵照圖2所示的執行次序來執行)：步驟S200：開始。步驟S210：接收一輸入文字以及將該輸入文字轉換為至少一特定維度的量化特徵。步驟S220: 將該至少一特定維度的量化特徵轉換為至少一特定信心水準的特徵矩陣。步驟S230: 儲存需要輸出之至少一標籤。步驟S240: 將該至少一特定信心水準的特徵矩陣轉換為至少一文本標籤以及一相對應的量化信心水準。請搭配圖2所示之各步驟以及圖1所示之各元件即可知各元件如何運作，為簡潔起見，故於此不再贅述。在本實施例中，提出了一種自動文本標籤方法與系統，該方法的實現可依賴於電腦程式，該電腦程式可以是基於診斷書管理系統對診斷書進行病名自動診斷的應用程式。該電腦系統可以是運行上述電腦程式的例如智慧手機、平板電腦、個人電腦等終端設備。此外，本發明的自動文本標籤系統100中的各模組，亦即：人工智慧語言處理模組110、人工智慧文本分類模組120、標籤庫130、人工智慧標籤輸出模組140，可以用電腦程式來實現，也可以用硬體的方式（如直接作成晶片）來實現。本發明的優點在於，利用自動文本標籤方法與系統不但可以取代人工，讓分類文本可以透過人工智慧加快人工分類效率，而且利用本發明的方法和系統，醫療主管部門以及保險公司能通過對病歷的自然語言分析，對診療情況、疾病趨勢進行大數據分析，從而提升醫療管理水平及保險的服務。以上所述僅為本發明之各種實施例而已，非因此而侷限本發明之專利範圍，故舉凡運用本發明說明書及圖式內容所為之簡易修飾及等效結構變化，均應包含於本發明所涵蓋專利範圍內。The invention is best understood by reference to the detailed description and the accompanying drawings. Various embodiments are discussed below with reference to the drawings. However, those skilled in the art will readily appreciate that the detailed description of the drawings herein is for the purpose of explanation and description For example, the teachings presented and the needs of a particular application may result in a variety of alternative and suitable methods for implementing the functionality of any of the details described herein. Thus, any method may extend beyond the specific implementation selections in the following embodiments described and illustrated. Certain terms are used throughout the description and following claims to refer to particular elements. It should be understood by those of ordinary skill in the art that hardware manufacturers may refer to the same elements by different nouns. The scope of this specification and the subsequent patent application do not use the difference of the names as the means for distinguishing the elements, but the difference in function of the elements as the criterion for distinguishing. The term "including" as used throughout the specification and subsequent claims is an open term and should be interpreted as "including but not limited to". In addition, the term "coupled" is used herein to include any direct and indirect electrical connection. Therefore, if a first device is coupled to a second device, it means that the first device can be directly electrically connected to the second device or indirectly electrically connected to the second device through other devices or connection means. Please refer to FIG. 1. FIG. 1 is a block diagram of an automatic text labeling system 100 of the present invention. As shown in FIG. 1 , the automatic text labeling system 100 includes an artificial intelligence language processing module 110 , an artificial intelligence text classification module 120 , a label library 130 , and an artificial intelligence label output module 140 . The artificial intelligence language processing module 110 is configured to receive an input text d and convert the input text d into at least one specific dimension of the quantized feature. That is, the artificial intelligence language processing module 110 is used to quantify the unstructured text d. The artificial intelligence language processing module 110 is electrically connected to a scanning device or a digital camera. After the scanning device or the digital camera acquires the document image, the optical character recognition (Optical Character Recognition) technology is used to scan or The text on the image of the document after the photo is digitized. Regarding the quantized feature of converting the input text d into at least one specific dimension, for example, the artificial intelligence language processing module 110 converts the input text d into a two-dimensional matrix (TxW) by using a technique of word embedding. The characteristic matrix, where T is the length of the time axis, that is, the sequence length of the input text, and W is the self-designed parameter of the core of the text classification. The size of this parameter is positively related to the complexity of the label system. Please note that in the implementation of the present invention In the example, the feature matrix of the two-dimensional matrix (TxW) can also be regarded as a one-dimensional matrix arranged in chronological order. The artificial intelligence text classification module 120 is configured to convert the at least one specific dimension of the quantized feature into at least one specific confidence level feature matrix. That is, the artificial intelligence text classification module 120 is configured to receive the digitized character feature matrix and the characteristic matrix of the corresponding output confidence level of the given label. For example, after the artificial intelligence text classification module 120 receives the feature matrix of the two-dimensional matrix (TxW), the artificial intelligence text classification module 120 processes the two-dimensional matrix (Neural Networks) to the two-dimensional matrix. The matrix of the matrix (TxW) is transformed into a one-dimensional matrix, where the array size is as large as the lexicon size of the tag library. The neural network can be a Recurrent Neural Network, a Long Short-Term Memory, or a Convolutional Neural Network. Please note that this is only Embodiments of the invention are not limiting of the invention. As shown in FIG. 1, the tag library 130 is configured to store at least one tag that needs to be output, that is, a tag of a core word in the text is stored in the tag library 130, and the artificial intelligence tag output module 140 will The feature matrix of the at least one specific confidence level is converted into at least one text label and a corresponding quantitative confidence level. That is, the artificial intelligence tag output module 140 is configured to convert a numerically-conformed feature matrix into a corresponding tag. For example, the present invention defines a loss function, which may be However, it is not limited to cross-entropy, focal loss, and mean square error MSE to calculate the difference between the confidence level matrix of the core output and the real result. The larger the difference, the larger the loss function will be. value. In addition, the artificial intelligence tag output module 140 uses a deep learning algorithm such as Stochastic gradient descent (SGD), Adagrad, AdaDelta, Adam, RMSProp to iteratively update the parameters of the core to accurately predict the level of confidence. It should be noted that these deep learning algorithms are only embodiments of the present invention, and are not limitations of the present invention. Any algorithm that can optimize parameters is in accordance with the spirit of the present invention and falls within the scope of the present invention. In an embodiment of the present invention, after the artificial intelligence language processing module 110 is configured to receive the diagnostic text content, the artificial intelligence language processing module 110 extracts various information of the diagnostic text content, performs semantic representation, and The diagnostic text content is converted into a feature matrix of the two-dimensional matrix (TxW). Thereafter, the artificial intelligence text classification module 120 can calculate the confidence level corresponding to each label, such as the result of how much confidence level the "leg trauma" has. The artificial intelligence label output module 140 converts the digitized confidence level matrix into a corresponding label. For example, if the confidence level is set to 0.5, the "leg trauma" label in the label library 130 corresponds to a confidence level of 0.5. The above will be judged to correspond to the "leg trauma" label. According to the above description, those skilled in the art can easily understand that, in other embodiments of the present invention, if the content of the diagnostic text received by the artificial intelligence language processing module 110 is "left foot contusion, the left foot fourth toe distal end The phalanx fracture, the artificial intelligence label output module 140 will be converted into a label: (1) lower limb contusion (2) trunk compression bruise. If the content of the diagnostic text received by the artificial intelligence language processing module 110 is “left upper limb laceration (outpatient surgery stitching 3 stitches)”, the artificial intelligence label output module 140 is converted into a label: (1) hand Open wounds, except for fingers. Alternatively, if the content of the diagnostic text received by the artificial intelligence language processing module 110 is “left ureteral calculi with obstructive renal edema”, the artificial intelligence label output module 140 is converted into a label: (1) renal edema (2) Kidney and ureteral stones. It should be noted that the above-described embodiments of the labels are merely illustrative of the invention and are not limiting of the invention. By the same token, in another embodiment of the present invention, the method and system of the present invention can be used to advance the post-mortem verification of the medical record to the event reminder. When the doctor writes the diagnosis, he can remind him of the non-compliant content in real time, and eliminate the occurrence of non-standard medical records from the source. The system can also automatically identify the doctor's diagnosis according to the natural language understanding and medical knowledge, and provide medical insurance for the artificial intelligence. Please refer to FIG. 2. FIG. 2 is a flowchart of an automatic text labeling method according to the present invention, including but not limited to the following steps (note that if substantially the same result is obtained, these steps are not necessarily To be executed in accordance with the execution order shown in FIG. 2): Step S200: Start. Step S210: Receive an input text and convert the input text into a quantized feature of at least one specific dimension. Step S220: Convert the quantized feature of the at least one specific dimension into a feature matrix of at least one specific confidence level. Step S230: Store at least one tag that needs to be output. Step S240: Convert the at least one specific confidence level feature matrix into at least one text label and a corresponding quantization confidence level. Please refer to the steps shown in FIG. 2 and the components shown in FIG. 1 to know how each component operates. For the sake of brevity, it will not be repeated here. In this embodiment, an automatic text labeling method and system is proposed. The implementation of the method may depend on a computer program, which may be an application for automatically diagnosing a medical certificate based on a diagnostic book management system. The computer system may be a terminal device such as a smart phone, a tablet computer, or a personal computer that runs the above computer program. In addition, each module in the automatic text labeling system 100 of the present invention, that is, the artificial intelligence language processing module 110, the artificial intelligence text classification module 120, the label library 130, and the artificial intelligence label output module 140, can use a computer. The program can be implemented in a hardware manner (such as directly forming a wafer). The invention has the advantages that the automatic text labeling method and system can not only replace the manual, but also the classified text can speed up the manual classification efficiency through artificial intelligence, and the medical authority and the insurance company can pass the medical record by using the method and system of the invention. Natural language analysis, big data analysis of medical conditions and disease trends, thereby improving medical management and insurance services. The above description is only for the various embodiments of the present invention, and is not intended to limit the scope of the present invention. Therefore, all modifications and equivalent structural changes that are made by the present specification and the drawings are included in the present invention. Covers the scope of patents.

100‧‧‧自動文本標籤系統100‧‧‧Automatic text labeling system

110‧‧‧人工智慧語言處理模組 110‧‧‧Artificial Wisdom Language Processing Module

120‧‧‧人工智慧文本分類模組 120‧‧‧Artificial Wisdom Text Classification Module

130‧‧‧標籤庫 130‧‧‧ tag library

140‧‧‧人工智慧標籤輸出模組 140‧‧‧Artificial Wisdom Label Output Module

S200~S240‧‧‧流程圖符號 S200~S240‧‧‧ flowchart symbol

下文將根據附圖來描述各種實施例，所述附圖是用來說明而不是用以任何方式來限制範圍，其中相似的標號表示相似的組件，並且其中：圖1係為本發明一種自動文本標籤系統的方塊圖。圖2係為本發明一種自動文本標籤方法的流程圖。The various embodiments are described below with reference to the accompanying drawings, in which FIG. A block diagram of the labeling system. 2 is a flow chart of an automatic text labeling method of the present invention.

Claims

An automatic text labeling system, comprising: an artificial intelligence language processing module, configured to receive an input text and convert the input text into at least one specific dimension of the quantized feature; an artificial intelligence text classification module coupled to the artificial a smart language processing module, configured to convert the at least one specific dimension of the quantized feature into at least one specific confidence level feature matrix; a tag library for storing at least one tag to be output; and an artificial intelligence tag output module And coupled to the artificial intelligence text classification module and the label library, configured to convert the at least one specific confidence level feature matrix into at least one text label and a corresponding quantization confidence level.

According to the automatic text labeling system of claim 1, wherein the artificial intelligence language processing module is additionally used to quantify unstructured text.

According to the automatic text labeling system of claim 1, wherein the artificial intelligence text classification module is further configured to receive the digitized character feature matrix and the corresponding output confidence level characteristic matrix of the label.

According to the automatic text labeling system of claim 1, wherein the artificial intelligence text classification module is further used to convert the quantized feature into a one-dimensional matrix, wherein the matrix array size is larger than the font size of the tag library. .

According to the automatic text labeling system of claim 1, wherein the artificial intelligence label output module is further used to convert a numerically confident level feature matrix into a corresponding label.

An automatic text labeling method, comprising: receiving an input text and converting the input text into at least one specific dimension of the quantized feature; converting the at least one specific dimension of the quantized feature into at least one specific confidence level feature matrix; storing required output At least one tag; and converting the at least one specific confidence level feature matrix into at least one text tag and a corresponding quantified confidence level.

The automatic text labeling method according to item 6 of the patent application scope further includes numerically unstructured text.

The automatic text labeling method according to item 6 of the patent application scope further includes receiving the digitized character feature matrix and the characteristic matrix of the corresponding output confidence level of the given label.

According to the automatic text labeling method of claim 6, the method further comprises converting the quantized feature into a one-dimensional matrix, wherein the matrix array size is larger than the lexicon size of the tag library.

The automatic text labeling method according to item 6 of the patent application scope further includes converting the characteristic matrix of the numerical confidence level into a corresponding label.