TW200809540A

TW200809540A - A method for string linking

Info

Publication number: TW200809540A
Application number: TW95128925A
Authority: TW
Inventors: Hsiao-Chun Tang; Ming-Yang Chiang; Jen-Diann Chiou
Original assignee: Nat Chengchi University
Priority date: 2006-08-07
Filing date: 2006-08-07
Publication date: 2008-02-16
Also published as: TWI314691B

Abstract

A method for string linking comprises inputting determined strings for classifying; calculating the number of each string; restoring contracted form in each string based on a contracted form table; calculating the similarity degree between any two strings; determining the two strings same when similarity degree is over a special degree between the two strings; and comparing the similarity degree between two strings until all similarity degrees are less than the special degree.

Description

200809540 九、發明說明：【發明所屬之技術領域】本發明係有關於一種字串鏈姓 -種呈衮夢X“ 更特別是有關於裡具合錯功旎之字串鏈結方法。【先前技術】 -般在進行專利之檢f與評料程巾，相200809540 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a string-chain name-type-presenting a nightmare X", and more particularly to a method for string-linking in which the wrong function is performed. 】 Generally, the patent inspection f and the evaluation towel,

數量之專利文件後，常需進行專利文件之1、 AAfter the number of patent documents, it is often necessary to carry out patent documents 1, A

據專利之專利權人或是根據專利之名稱進行：類：於名稱可縮寫之緣故，常造成原本庫屬冉類。然Μ 利文件被分成㈣。核屬相同專利權人U /列如’以台灣積體電路公司為例，其所登記之專利孝人英文名稱，可有TSMC，Tal職semieGndu咖⑺卿 t7v\s~uctor limited company 電路Η，且其中company亦可使用簡寫，c〇.，來加㈣代。在如此多之登記名稱下’傳統之自動分類方法，會* 使用不同專利權人名稱進行登記之專利，即使此些專^ 應分類於同—公司名稱，卻將其誤判斷為不同公司所有。因此’傳統之分類方法’在分類完成後’冑需藉由使則之幫助再行將代表同—專人，卻被歸屬於*同類之肩利，重新分在一起，如此反增加使用者之負擔。因此，如何在不增加使用者操作負擔下，增進分類戈正確性及成為追求之目標。、【發明内容】 200809540 ::本备明之主要目的就是在提供一種字串鏈結方法，可鏈結代表同_目標名稱之各名稱。 ° 根據本發明之字串鏈結方 .· 之判別字串.彳Pit 包括.輸入欲進行分類頻=Γ 之判別字串，計算各判別字串出對照表，將判別字串中特定之簡寫還〗：：：兩兩子串間之相似度；將兩字串相似度值運算結二b界值之兩字串判定為同— • _度比對h兩兩字串間之相似度均低於臨界值丁根據一實施例，1 φ 1女庫车以L 有較少判別字串*現次數之對應子串會被消滅，而存 τ 應字串。存H車乂多判別字串出現次數之對根據-實施例’運算兩兩字串間之相似度，更用動態規劃演算法，來找出兩及將此共通字串長度料㈣’以 ^ 我度進仃正規化以獲至一相似度值。夂種：上述所本發明之方法，可將代表同-事物之 •:::判別名稱進行鏈結並將其歸類入-名稱下，因 :分類時，可將原歸屬不同判別名稱之物併入-%下，增進分類之精確性。【實施方式】 :下㈣搜尋並分類以台灣㈣電路公司為專利權人 2案之貫施例來說明本發明之應用。然值得注意的是，本餐明並非僅限於應用在上述之實施例中。參閱$ 1圖所示為本發明將不同名稱進行鏈結並歸類同—名稱之概略流程圖。根據本實施例，台灣積體電路 200809540 公司其所可能登記之專利權人英文名稱，有TSMC，Taiwan Semiconductor company, Taiwan Semiconductor limited company, Taiwan Semiconductor co.5 Taiwan Semiconductor LTD. company，或 Taiwan Semiconductor LTD. Co,等均代表台灣積體電路公司。在傳統分類方法下，這些不同名稱會被判定成不同之專利權人，而使得專利分類失真，而本發明即是用以解決此問題。首先在步驟101，使用者需輸入欲進行搜尋分類之專利權人名稱，亦即輸入欲進行分類之判別字串。例如，在搜尋出之一定數量專利中，專利分析人員可能想要知道某些專利專利權人名下所擁有之專利，此時即可將該些專利權人之可能名稱輸入。接著在步驟102中，本發明之方法，即可根據所輸入之判別字串，計算各判別字串出現次數，由於一篇專利專利專利權人名稱僅出現一次，因此計算各專利專利權人名稱次數’亦即計异對應各專利專利權人名稱之專利數目，並將各專利分類至對應之專利權人名稱下。以上述台灣積體電路公司專利權人名稱為例，在此步驟中，會分別計算台灣積體電路公司不同專利權人名稱，例如，TSMC，Taiwan Semiconductor company等專利權人名稱出現次數，並將專利分類至對應之專利權人名稱下。接著於步驟103，使用一簡稱對照表1〇4，將判別字串中特定之簡寫遥原。以上述實施例為例，例如，limited之簡寫為 LTD·、company之簡寫為 Co.或 Taiwan Semiconductor company之簡寫為TSMC等，這些特定之簡 7 200809540 寫與所對應之還原字串，均可由使用者於簡稱對照表104 中加以定義，並於步驟103中根據這些簡寫與還原字串間之定義，將判別字串中之簡寫對應還原。例如，Taiwan Semiconductor LTD· Co.在經由步驟103之還原過程後，其所呈現之字串將成為 Taiwan Semiconductor limited company 〇此時 Taiwan Semiconductor LTD. Company 和 Taiwan Semiconductor LTD· Co在經此還原後，會被視為相同之專利.權人。由於一般在輸入字串時無可避免的會因人為之疏失而造成輸入字串之錯誤，同樣的，在專利申請之過程中，亦有可能因人為之誤繕，而造成專利權人名稱遺漏掉某一字母，例如在輸入「semiconductor」字時，誤輸入成「semiconducter」，而造成應歸屬同一專利專利權人之兩專利，被判定成非屬同一專利專利權人。因此在本發明之步驟105中，會藉由運算兩兩字串間之相似度，並將相似度大於一特定值以上之兩字串視為同一字串，來彌補此些可能發生之人為錯誤。在一實施例中，本發明判別兩兩字串間相似度之方法，係於步驟107中，藉由使用一習知之動態規劃演算法，來找出兩字串間之最長共通字串，例如，以「Taiwan Semiconductor company」與「Taiwan Semiconducter company」兩字串而言，其中兩字串長度均為26，而兩字串間之最長共通字串為「Taiwan semiconductr company」，其字串長度為25。接著，再於步驟108中，將此共通字串長度進行正規化，例如，將此共通字串長度，25，除以兩 8 200809540 字串長度之平均值，26,以獲至一比值96%，亦即相似度值為96%。其中’在經過步驟1〇3還原後之各判別字串，兩兩字串間均會進行上述字串間之相似度運算。接著，於步驟1G9,會將兩字串相似度值運算結果高於 L界值之兩子串判定為同一字串。假設於一實施例中，所設定之臨界值為帆，以上述之例子而言其相似度值為 96% ’大於所设定之臨界值&嶋。因此，「加麵According to the patentee of the patent or according to the name of the patent: Class: Because the name can be abbreviated, it often causes the original library to belong to the genus. The document is then divided into (iv). The same patentee U / column as the example of Taiwan's integrated circuit company, the registered English name of the filial piety, can have TSMC, Tal job semieGndu coffee (7) Qing t7v\s~uctor limited company circuit, And in which company can also use shorthand, c〇., to add (four) generation. Under so many registration names, the traditional automatic classification method will use patents registered with different patentee names, even if these special categories are classified in the same company name, they are misjudged as different companies. Therefore, after the classification is completed, the 'traditional classification method' is not required to be represented by the same person, but it is attributed to the shoulder of the same kind, and is re-separated, thus increasing the burden on the user. . Therefore, how to improve the correctness of classification and become the goal of pursuit without increasing the burden of user operation. [Description of the Invention] 200809540: The main purpose of this specification is to provide a string link method, which can represent the names of the same _ target name. ° According to the string link of the present invention. The discriminant string. 彳 Pit includes: input the discriminant string to be classified frequency = Γ, calculate the discriminant string of each discriminant string, and discriminate the specific abbreviation in the string. Also:::: the similarity between the two sub-strings; the two-string similarity value operation and the two b-values of the two-string value are judged as the same - • _ degree comparison h the similarity between the two strings is lower than According to an embodiment, a corresponding substring of 1 φ 1 female warehouse car with L has fewer discriminant strings * the current number of times will be destroyed, and τ should be a string. The number of occurrences of the number of discriminating strings of the H-cars is calculated according to the -example's similarity between the two strings, and the dynamic programming algorithm is used to find out the length of the two strings and the length of the common string (4)' The degree is normalized to obtain a similarity value.夂 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Incorporate -% to improve the accuracy of the classification. [Embodiment]: Next (4) Search and classify the application of the present invention by using the example of Taiwan (4) Circuit Company as the patentee 2 case. It should be noted that this meal is not limited to the application in the above embodiments. Referring to Figure 1 is a schematic flow chart showing the different names of links and categorizations of the present invention. According to the present embodiment, the English name of the patentee of the Taiwan Integrated Circuit 200809540 company may be TSMC, Taiwan Semiconductor company, Taiwan Semiconductor limited company, Taiwan Semiconductor co. 5 Taiwan Semiconductor LTD. company, or Taiwan Semiconductor LTD. Co, etc. represent Taiwan Integrated Circuit Company. Under the traditional classification method, these different names are judged to be different patent holders, which makes the patent classification distorted, and the present invention solves this problem. First, in step 101, the user needs to input the name of the patentee who wants to perform the search classification, that is, input the discriminant string to be classified. For example, in a certain number of patents searched, patent analysts may want to know the patents owned by certain patent holders, and the possible names of these patent holders can be entered at this time. Next, in step 102, the method of the present invention can calculate the number of occurrences of each discriminant string according to the input discriminant string. Since the name of a patent patent holder appears only once, the name of each patent holder is calculated. The number of times' is the number of patents corresponding to the names of patent holders, and each patent is classified under the name of the corresponding patentee. Taking the name of the patent holder of the above-mentioned Taiwan Integrated Circuit Company as an example, in this step, the names of different patent holders of Taiwan Integrated Circuit Company, for example, the number of patent holder names such as TSMC, Taiwan Semiconductor company, etc., will be calculated. The patent is classified under the name of the corresponding patentee. Next, in step 103, a specific abbreviation of the teletext in the string is discriminated using an abbreviation comparison table 1〇4. Taking the above embodiment as an example, for example, the abbreviation of limited is LTD., the abbreviated name of company is Co. or the abbreviation of Taiwan Semiconductor company is TSMC, etc., and the specific simplified 7 200809540 can be used by the corresponding restored string. The definition is in the abbreviation comparison table 104, and in step 103, the shorthand correspondence in the discriminant string is restored according to the definition between the shorthand and the restored string. For example, after the reduction process of Taiwan Semiconductor LTD. Co. via step 103, the string presented will become the Taiwan Semiconductor limited company. At this time, Taiwan Semiconductor LTD. Company and Taiwan Semiconductor LTD. Co will be restored. It is considered the same patent. Since it is inevitable that the input string will be mistaken due to human error when inputting the string, similarly, in the process of patent application, the name of the patentee may be missed due to human error. When a letter is deleted, for example, when the word "semiconductor" is input, it is mistakenly entered as "semiconducter", and two patents belonging to the same patentee are deemed to be not the same patent holder. Therefore, in step 105 of the present invention, such human error may be compensated for by calculating the similarity between the two strings and treating the two strings having the similarity greater than a certain value as the same string. In an embodiment, the method for determining the similarity between two strings is performed in step 107 by using a conventional dynamic programming algorithm to find the longest common string between two strings, for example, In the case of the "Taiwan Semiconductor Company" and the "Taiwan Semiconducter company", the length of the two strings is 26, and the longest common string between the two strings is "Taiwan semiconductr company", which has a string length of 25. Then, in step 108, the common string length is normalized, for example, the common string length, 25, divided by the average of two 8 200809540 string lengths, 26, to obtain a ratio of 96%. , that is, the similarity value is 96%. Among them, the discriminant string after the step 1〇3 is restored, and the similarity between the strings is performed between the two strings. Next, in step 1G9, the two substrings whose two-string similarity value operation result is higher than the L-boundary value are determined to be the same character string. It is assumed that in an embodiment, the set threshold value is a sail, and in the above example, the similarity value is 96%' greater than the set threshold value & Therefore, "adding a face

SemiC〇ndUCt〇r 」和「™麵 Semiconducter 嶋啊」兩字串會被視為同-字串，此時，分別對應上述字串之兩組專利會彼此合併，而成為-組專利。根據本實施例，由於因誤έ盖所；生士 + # * ,、μ 决、^所仏成之專利權人名稱錯誤，分類在此祕之專人名财之專賴目畢竟較少。因此，於步 =109中之兩字串對應專利彼此合併，其中具有較少對應^會㈣滅，㈣留具有好專利數目之對 =數目!^之，具較少專利數目之專利組會併人具較多併合㈣進行自返之相似度比對與對應專利之合臨=驟字串間之相似度均低於所-之如第===:明之相似度比對中，兩字串， . a雖然彼此間之相似度值係小於臨界值，但可透過另一炫 1 門，、予串，如弟三字串，因第一與第三字串值，而# =心二字串間’彼此之相似度值均大於臨界利得以合併—起…視為同-子串，其對應之專 ^而視為同一專利權人所有。口上述所& ’本發明之方法，可將代表同一事物之 200809540 各種不同判別名稱進行鏈結並將其歸類入一名稱下，因此’當用於專利分類時，可將原歸屬不同判別名稱下之專利合併入一名稱下，增進專利分類之精確性。雖然本發明已以一較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍内，當可作各種之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。【圖式簡單說明】、優點與實施例技歸類為同一名為讓本發明之上述和其他目的、特徵能更明顯易个蓳，所附圖式之詳細說明如下第1圖所示為本發明將不同名稱進行鍵結稱之概略流程圖。 " 【主要元件符號說明】 1()1〜110步驟The string "SemiC〇ndUCt〇r" and "TM Face Semiconducter" will be regarded as the same-string. At this time, the two sets of patents corresponding to the above-mentioned string will be merged with each other and become a group patent. According to this embodiment, since the name of the patentee is wrong due to the mistaken cover; the name of the patentee of the student, the name of the person who is classified in this secret is less. Therefore, the two strings in step = 109 correspond to each other, in which there are fewer correspondences (4), and (4) the number of patents with a good number of patents = ^, which has a smaller number of patents. (4) The degree of similarity between the similarity comparison and the corresponding patents of the self-return is lower than the similarity of the ===: the similarity of the comparison, the two strings, a Although the similarity value between them is less than the critical value, but through another dazzle 1 door, to the string, such as the younger string, because of the first and third string values, and # = heart two string 'The similarity value of each other is greater than the critical profit can be merged--is regarded as the same-substring, which corresponds to the special ^ and is regarded as the same patent owner. According to the method of the present invention, the various discriminating names representing 200809540 representing the same thing can be linked and classified into a name, so that when used for patent classification, the original attribution can be determined differently. The patent under the name is merged into a name to improve the accuracy of the patent classification. Although the present invention has been described above in terms of a preferred embodiment, it is not intended to limit the invention, and it is obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit and scope of the invention. The scope of the invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and embodiments of the present invention are summarized as the same as the above-mentioned and other objects and features of the present invention. The detailed description of the drawings is as follows. The invention summarizes the key names of different names. " [Main component symbol description] 1 () 1 ~ 110 steps

Claims

200809540 X. The scope of application for patents: 1 · A string of keys, έ士方,本, at m 庳 ΑΠ ΑΠ J is a method for determining the number of different strings in the same substring, including: The comparison table will be stored in the original; the specific abbreviations in the two strings are also used to calculate the similarity value between the two strings; the string is determined to be the same as the continuous similarity value, and the similarity between the two substrings is a low string; and a string of similarity values between the two strings higher than a critical value at the critical value... 2· as in the middle of the patent range 帛i, the string linking method, the difference between the two strings The similarity value further includes: using a dynamic programming algorithm to find the longest common string between the two strings; and normalizing the common string length to obtain the similarity value. 3. The method according to claim 2, wherein the normalizing the length of the common string further comprises: counting an average string length of the two strings; and dividing the length of the common string Take the average string length. 4. The method of chaining as described in the scope of claiming patents includes calculating the number of occurrences of each discriminant string. 11 200809540 5 · If you apply for a patent II certificate, you can also include the word stringing method described in 1 item of Capricorn, including the definition of the abbreviation. 6-, the string stringing method is used to classify a plurality of technical documents, wherein the parent-page technical file has a discriminating string, including: inputting a discriminating string to be classified; = discriminating string of the input person 'Calculating the number of surgical files corresponding to each discriminant string; the abbreviated reduction; using a short name comparison table 'the similarity value between the discriminant strings of the discriminant strings in the discriminant string; and ... between the two discriminant strings The similarity value is higher than the -the critical value of the two discriminant strings is determined to be the same string 'and the technical files corresponding to the two discriminant strings are merged; the similarity comparison is continued until the similarity between the two pairs of discriminant strings is lower than Threshold value. 7. The method for chaining links according to item 6 of the patent application scope, wherein the similarity value between the two different discriminant strings includes: using a dynamic programming algorithm to find the longest common between the two discriminant strings a string; and normalizing the common string length to obtain the similarity value. For example, the string linking method described in claim 7 wherein 12 200809540 normalizing the common string length further comprises: calculating an average string length of the two discriminant strings; and lengthening the common string length of 6 Divide by the average string length. The method for concatenating the string according to Item 6, wherein the number of files, the technical file corresponding to the string, and the corresponding discriminant string of the number of corresponding discriminative documents having a smaller number of technical cattle are retained. The definition of 1 word 0 ^; please refer to the string connection method described in item 6 of the patent scope, and more.