TWI314691B

TWI314691B - A method for string linking

Info

Publication number: TWI314691B
Application number: TW95128925A
Authority: TW
Inventors: Hsiaochun Tang; Ming Yang Chiang; Jen Diann Chiou
Original assignee: Nat Chengchi University
Priority date: 2006-08-07
Filing date: 2006-08-07
Publication date: 2009-09-11
Also published as: TW200809540A

Description

1314691 九、發明說明：【發明所屬之技術領域】本發明係有關於一種字串鏈钟 -種具容錯功能之字串鏈結方法=…更特別是有關於【先前技術】 —般在進行專利之檢索與評估過數量之專利文件後，常需進行專利文件之八二搜尋到一定據專利之專利權人或是根據專利：：頰’例如’根於名稱可縮寫之緣故，常造成原本鹿屬相2類。然，由利文件被分成兩類。同專利權人之專例如’以台灣積體電路公司為人英文名稱，可有TSMC τ·例其所登記之專利權 C，TaiWan Semic〇nduct()r1314691 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a string chain clock - a string-chaining method with a fault-tolerant function = ... more particularly with respect to [prior art] After searching and evaluating a number of patent documents, it is often necessary to conduct a patent search for the patentee of the patent document or according to the patent: the cheek 'for example, 'the root name can be abbreviated, often causing the original deer It belongs to the second class. However, the documents are divided into two categories. For the patentee's special, for example, 'Taiwan's integrated circuit company's English name, there may be TSMC τ. For example, its registered patent rights C, TaiWan Semic〇nduct()r

Taiwan Semiconductor limited company ^ . A ^ 電路公司，且直φ 句代表台灣積體 ❶爪以町亦可使用簡寫，CO.，來加以# 代。在如此多之登記名稱下，傳統之自動：加以替使用不同專利權人名稱進行登記之專利，即使心專^將應分類於同—公司名稱，卻將其誤㈣ ^右】本因此，傳統之分類方法，在分類完成後，還有。之幫助再行將冲主门^ 還4藉由使用者利番雜、表同一專利權人，卻被歸屬於不同類之袁 =新为在一起，如此反增加使用者之負擔。專因此，如何在不增加使用者操作負擔正確性及成為追求之目標。胃進刀類之【發明内容】 1314691 因此，本發明之主要目的就是在提供一種字串鏈結方法，可鏈結代表同一目標名稱之各名稱。根據本發明之字串鏈結方法，包括：輸入欲進行分類之判別字串；根據所輸入之判別字串，計算各判別字串出現次數；使用一簡稱對照表，將判別字串中特定之簡寫還原；運算兩兩字串間之相似度；將兩字串相似度值運算結果高於-臨界值之兩字串判定為同—字串；以及持續進行相似度比對直至兩兩字串間之相似度均低於臨界值。〜根據-實施例’其中具有較少判別字串出現次數之對應字串會被消滅，而存留具有較多制字串出現次數之對應予串。根據-實施例，運算兩兩字串間之相似度，更包括使用動態規劃演算法，來找出兩字串間之最長共通字串；以及將此共通字串長度進行正規化以獲至—相似度值。综合上述所言，本發明之方法，可將代表同一事物之 =種=同判別名稱進行鏈結並將其歸類入一名稱下，因此1用於分類時，可將原歸屬不同判別名稱之物併入— 名稱下，增進分類之精確性。【實施方式】專以下將以搜尋並分類以台灣積體電路公司為專利權人 )案之實施例來說明本發明之應用。然值得注意的是，發明並非僅限於應用在上述之實施例中。為5參閱f1圖所不為本發明將不同名稱進行鍵結並歸類 ’、、同一名稱之概略流程圖。根據本實施例，台灣積體電路 1314691 公司其所可能登記之專利權人英文名稱，有TSMC，Taiwan Semiconductor company, Taiwan Semiconductor limited company, Taiwan Semiconductor co., Taiwan Semiconductor LTD. company,或 Taiwan Semiconductor LTD. Co,等均代表台灣積體電路公司。在傳統分類方法下，這些不同名稱會被判定成不同之專利權人，而使得專利分類失真，而本發明即是用以解決此問題。首先在步驟101，使用者需輸入欲進行搜尋分類之專利權人名稱，亦即輸入欲進行分類之判別字串。例如，在搜尋出之一定數量專利中，專利分析人員可能想要知道某些專利專利權人名下所擁有之專利，此時即可將該些專利權人之可能名稱輸入。接著在步驟102中，本發明之方法，即可根據所輸入之判別字串，計算各判別字串出現次數，由於一篇專利專利專利權人名稱僅出現一次’因此計算各專利專利權人名稱次數’亦即計算對應各專利專利權人名稱之專利數目，並將各專利分類至對應之專利權人名稱下。以上述台灣積體電路公司專利權人名稱為例，在此步驟中，會分別計算台灣積體電路公司不同專利權人名稱，例如，TSMC, Taiwan Semiconductor company等專利權人名稱出現次數，並將專利分類至對應之專利權人名稱下。接著於步驟103，使用一簡稱對照表1 〇4，將判別字串中特定之間寫還原。以上述實施例為例，例如，limited之簡寫為LTD.、company之簡寫為 Co.或Taiwan Semiconductor company之簡寫為TSMC等，這些特定之簡 1314691 寫與所對應之還原字串，均可由使用者於簡稱對照表104 中加以定義，並於步驟103中根據這些簡寫與還原字串間之定義，將判別字串中之簡寫對應還原。例如，Taiwan Semiconductor LTD. Co.在經由步驟103之還原過程後，其所呈現之字串將成為 Taiwan Semiconductor limited company。此時 Taiwan Semiconductor LTD. Company 和 Taiwan Semiconductor LTD. Co在經此還原後，會被視為相同之專利權人。由於一般在輸入字串時無可避免的會因人為之疏失而造成輸入字串之錯誤，同樣的，在專利申請之過程中，亦有可能因人為之誤繕，而造成專利權人名稱遺漏掉某一字母，例如在輸入「semiconductor」字時，誤輸入成「semiconducter」，而造成應歸屬同一專利專利權人之兩專利，被判定成非屬同一專利專利權人。因此在本發明之步驟105中，會藉由運算兩兩字串間之相似度，並將相似度大於一特定值以上之兩字串視為同一字串，來彌補此些可能發生之人為錯誤。在一實施例中，本發明判別兩兩字串間相似度之方法，係於步驟107中，藉由使用一習知之動態規劃演算法，來找出兩字串間之最長共通字串，例如，以「Taiwan Semiconductor company」與「Taiwan Semiconducter company」兩字串而言，其中兩字串長度均為26，而兩字串間之最長共通字串為「Taiwan semiconductr company」，其字串長度為25。接著，再於步驟108中，將此共通字串長度進行正規化，例如，將此共通字串長度，25，除以兩 1314691 字串長度之平均值，26,以獲至一比值96%，亦即相似度值為96%。其中’在經過步驟1〇3還原後之各判別字串，兩兩字串間均會進行上述字串間之相似度運算。接著’於步驟109’會將兩字串相似度值運算結果高於臨界值之兩字串判定為同_字串。假設於—實施例中，所叹疋之fe界值為80%，以上述之例子而言其相似度值為 96%’大於所設定之臨界值為8〇%。因此，「丁31丽 Sem—tor company」和「加觸咖 _pany」兩字串會被視為同—字串，此時，分別對應上述字串之兩組專利會彼此合併，而成為一組專利。根據本實施例，由於因誤繕所造成之專利權人名稱錯誤，分類在此誤譜之專利權人名稱下之專利數目畢竟較少。因此，於步 2⑽中之兩字串對應專利彼此合併，其中具有較少專利對應Μ會被消滅，而存留具有較多專利數目之對易言之，錄少專概目之專频會併人具較多併舍拉I之專利組内。上述之相似度比對與對應專利之合 ^值’直至兩兩字串間之相似度均低於所述定之臨界值，如步驟110所述。值得注意料，在本發明之相二:=界仁了透過另一字串，如第三字串，因第—與以及第二與第三字串間’彼此之相似度值值，而使得第-與第二字串被大於臨界利得以合併-起，而視為同-專_人=:其對應之專綜合上述所言，本發明之方法，可將代表同一事物之 1314691 各種不同判別名稱此，當用於專利分類日/肖並將其歸類人—名稱下’因利合併人—名稱㈣可將原歸屬不同判別名稱下之專雖然本發明已以―較佳，精確性。以限定本發明，任何熟習此==如上，然其並非用護範圍當視後二=更動與潤傳，因此本發明之保之申叫專利範圍所界定者為準。【圖式簡單說明】為讓本發明之上述和其他目的、能更明顯易懂，所附圖式之詳細說明如^ _ ‘點與實施例第1圖所示為本發明將不同名稱進行鏈結並歸類為同一名稱之概略流程圖。為问一名【主要元件符號說明 101〜110步驟 10Taiwan Semiconductor limited company ^ . A ^ Circuit company, and the straight φ sentence represents the Taiwanese body. The claws can also use abbreviations, CO., to add # generation. Under so many registration names, the traditional automatic: to use the patents registered with different patentee names, even if the mind should be classified in the same company name, but it will be wrong (four) ^ right] this, the traditional The classification method, after the classification is completed, is still there. The help will be rushed to the main door. ^4 By using the user, the same patentee, but belonging to different classes of Yuan = new together, thus increasing the burden on the user. Therefore, how to improve the user's operational burden is correct and become the goal of pursuing. Stomach Feeding Type [Summary Content] 1314691 Accordingly, it is a primary object of the present invention to provide a string linking method that can be linked to names representing the same target name. According to the string linking method of the present invention, the method comprises: inputting a discriminant string to be classified; calculating a number of occurrences of each discriminant string according to the input discriminant string; using a short form comparison table, discriminating the specific one in the string Abbreviation reduction; calculating the similarity between two strings; determining two strings of two string similarity values above the -threshold as the same-string; and continuing the similarity comparison until the two-string The similarity is below the critical value. The corresponding string in which the number of occurrences of the discriminant string is less than the number of occurrences of the discriminating string is eliminated, and the corresponding number of occurrences of the number of string strings is retained. According to the embodiment, the similarity between the two strings is calculated, and the dynamic programming algorithm is used to find the longest common string between the two strings; and the length of the common string is normalized to obtain the similarity value. In summary, the method of the present invention can link the same type of the same thing to the same name and classify it into a name, so when 1 is used for classification, the original distinguished name can be assigned. Incorporate – under the name, to improve the accuracy of the classification. [Embodiment] The application of the present invention will be described below by way of an example in which the search and classification of the Taiwan Semiconductor Circuit Company is the patentee. It is to be noted that the invention is not limited to the application in the embodiments described above. For the sake of 5, refer to the f1 diagram, which is not a schematic flow chart for naming and naming different names for the present invention. According to the present embodiment, the Taiwanese entity circuit 1314691 company may register the English name of the patentee, TSMC, Taiwan Semiconductor company, Taiwan Semiconductor limited company, Taiwan Semiconductor co., Taiwan Semiconductor LTD. company, or Taiwan Semiconductor LTD. Co, etc. represent Taiwan Integrated Circuit Company. Under the traditional classification method, these different names are judged to be different patent holders, which makes the patent classification distorted, and the present invention solves this problem. First, in step 101, the user needs to input the name of the patentee who wants to perform the search classification, that is, input the discriminant string to be classified. For example, in a certain number of patents searched, patent analysts may want to know the patents owned by certain patent holders, and the possible names of these patent holders can be entered at this time. Next, in step 102, the method of the present invention can calculate the number of occurrences of each discriminant string according to the input discriminant string, since the name of a patent patent holder appears only once, thus calculating the name of each patent holder. The number of times' is the number of patents corresponding to the names of the patent holders, and the patents are classified under the name of the corresponding patentee. Taking the name of the patentee of Taiwan Integrated Circuit Company as an example, in this step, the names of different patent holders of Taiwan Integrated Circuit Company, for example, the number of patentee names such as TSMC, Taiwan Semiconductor company, etc., will be calculated. The patent is classified under the name of the corresponding patentee. Next, in step 103, a specific write-on-write in the discriminant string is restored using a shorthand comparison table 1 〇4. Taking the above embodiment as an example, for example, the abbreviation of limited is LTD., the abbreviation of company is Co. or the abbreviation of Taiwan Semiconductor company is TSMC, etc., and the specific simplified 1131691 written and corresponding restored string can be used by the user. It is defined in the abbreviation comparison table 104, and in step 103, the shorthand correspondence in the discriminant string is restored according to the definition between these shorthand and restored strings. For example, after the reduction process of Taiwan Semiconductor LTD. Co. via step 103, the string presented will become Taiwan Semiconductor limited company. At this time, Taiwan Semiconductor LTD. Company and Taiwan Semiconductor LTD. Co will be regarded as the same patentee after being restored. Since it is inevitable that the input string will be mistaken due to human error when inputting the string, similarly, in the process of patent application, the name of the patentee may be missed due to human error. When a letter is deleted, for example, when the word "semiconductor" is input, it is mistakenly entered as "semiconducter", and two patents belonging to the same patentee are deemed to be not the same patent holder. Therefore, in step 105 of the present invention, such human error may be compensated for by calculating the similarity between the two strings and treating the two strings having the similarity greater than a certain value as the same string. In an embodiment, the method for determining the similarity between two strings is performed in step 107 by using a conventional dynamic programming algorithm to find the longest common string between two strings, for example, In the case of the "Taiwan Semiconductor Company" and the "Taiwan Semiconducter company", the length of the two strings is 26, and the longest common string between the two strings is "Taiwan semiconductr company", which has a string length of 25. Then, in step 108, the common string length is normalized, for example, the common string length, 25, divided by the average of the two 1314691 string lengths, 26, to obtain a ratio of 96%. That is, the similarity value is 96%. Among them, the discriminant string after the step 1〇3 is restored, and the similarity between the strings is performed between the two strings. Then, in step 109, the two strings of which the two-string similarity value operation result is higher than the critical value are judged as the same_string. It is assumed that in the embodiment, the feb value of the sigh is 80%, and in the above example, the similarity value is 96%' greater than the set threshold value of 8〇%. Therefore, the words "Ding 31 Li Sem-tor company" and "Plus _pany" will be regarded as the same-string. At this time, the two sets of patents corresponding to the above-mentioned strings will merge with each other and become one. Group patent. According to this embodiment, the number of patents classified under the name of the patentee of this misunderstanding is less, due to the wrong name of the patentee due to misunderstanding. Therefore, the two strings in step 2 (10) correspond to each other's patents, and those with fewer patents will be eliminated, while the number of patents with more patents will be kept, and the special frequency will be recorded. There are more patents in the patent group. The similarity between the above-mentioned similarity comparisons and the corresponding patents until the similarity between the two strings is lower than the predetermined threshold, as described in step 110. It should be noted that in the second phase of the present invention: = the boundary is passed through another string, such as the third string, because of the similarity values between the first and the second and third strings, - the second string is merged with greater than the critical profit, and is considered to be the same - the special = person =: its corresponding combination of the above said, the method of the present invention, can represent the same thing 13146691 various distinguished names Therefore, when used in the patent classification date / Xiao and its categorization - under the name 'Ingli merged person - name (4) can be attributed to the original different identification name, although the invention has been "better, accurate." In order to limit the present invention, any familiarity with this == as above, but it is not the scope of protection, and the second is the change and the pass, and therefore the scope of the patent application of the present invention is subject to the definition. BRIEF DESCRIPTION OF THE DRAWINGS In order to make the above and other objects of the present invention more comprehensible, the detailed description of the drawings is as shown in the accompanying drawings. The summary is a summary flow chart of the same name. To ask a [main component symbol description 101~110 step 10

Claims

1314691 X. Patent application scope: 1. The seed string linkage method is a method for discriminating that a plurality of different words should belong to the same string, including: original; using a short list of comparison tables, which will be stored in the strings The specific shorthand also computes the similarity value between the two strings; the two-string similarity value is higher than the _threshold value of the two-string string; and u continues the similarity value comparison until any two strings In relation to the critical value. In addition, the string concatenation method described in the scope of claim ,1, wherein the operation of the similarity value between the two strings, further includes: using the dynamic programming algorithm to find the longest common string between the two strings; And normalizing the common string length to obtain the similarity value of JL. 3. The method according to claim 2, wherein the normalizing the length of the common string further comprises: calculating an average string length of the two strings; and dividing the length of the common string by The average string length. 4. The method for string linking as described in claim 1 of the patent scope further includes calculating the number of occurrences of each discriminant string. , 13146691 5. The method for string linking as described in claim 1 of the patent application, and the definition of the abbreviation reference table. 6. The method for chaining a string is used to classify a plurality of technical documents, wherein each of the technical files has a discriminant string, including: inputting a discriminant string to be classified; according to the discriminant string input, Calculating the number of surgical files corresponding to each discriminant string; &original; using a short name comparison table, the specific abbreviations in the discriminant string are also used to calculate the similarity value between the two discriminant strings; The two discriminant strings whose similarity value is higher than a critical value are judged as the same-string, and the technical files corresponding to the two discriminant strings are merged; the similarity between the lower-order similarity comparisons until the two-two discriminant strings are 7. If the word stringing method described in the patent application scope calculates the similarity value between the two pairs of discriminant strings, it also includes: the eight-pass word ^ using the dynamic programming algorithm to find the longest between the two discriminant strings And the length of the far-common string is normalized to obtain the similarity value. Among them, the string linking method described in the patent application 帛7 item, 12 1314691, the common string length Performing normalization further includes: calculating an average string length of the two discriminant strings; and dividing the common string length by the average string length. 9_such as the string concatenation method described in claim 6 The technical file corresponding to the two discriminant strings is merged, and the corresponding discriminant string including the number of fewer technical files is destroyed, and the corresponding discriminant string having the number of technical files is retained. The string linking method described in the six items further includes the abbreviation reference table.

(S) 13