TWI314691B - A method for string linking - Google Patents

A method for string linking Download PDF

Info

Publication number
TWI314691B
TWI314691B TW95128925A TW95128925A TWI314691B TW I314691 B TWI314691 B TW I314691B TW 95128925 A TW95128925 A TW 95128925A TW 95128925 A TW95128925 A TW 95128925A TW I314691 B TWI314691 B TW I314691B
Authority
TW
Taiwan
Prior art keywords
string
discriminant
strings
length
similarity value
Prior art date
Application number
TW95128925A
Other languages
Chinese (zh)
Other versions
TW200809540A (en
Inventor
Hsiaochun Tang
Ming Yang Chiang
Jen Diann Chiou
Original Assignee
Nat Chengchi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nat Chengchi University filed Critical Nat Chengchi University
Priority to TW95128925A priority Critical patent/TWI314691B/en
Publication of TW200809540A publication Critical patent/TW200809540A/en
Application granted granted Critical
Publication of TWI314691B publication Critical patent/TWI314691B/en

Links

Description

1314691 九、發明說明: 【發明所屬之技術領域】 本發明係有關於一種字串鏈钟 -種具容錯功能之字串鏈結方法=…更特別是有關於 【先前技術】 —般在進行專利之檢索與評估過 數量之專利文件後,常需進行專利文件之八二搜尋到一定 據專利之專利權人或是根據專利::頰’例如’根 於名稱可縮寫之緣故,常造成原本鹿屬相2類。然,由 利文件被分成兩類。 同專利權人之專 例如’以台灣積體電路公司為 人英文名稱,可有TSMC τ·例其所登記之專利權 C,TaiWan Semic〇nduct()r1314691 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a string chain clock - a string-chaining method with a fault-tolerant function = ... more particularly with respect to [prior art] After searching and evaluating a number of patent documents, it is often necessary to conduct a patent search for the patentee of the patent document or according to the patent: the cheek 'for example, 'the root name can be abbreviated, often causing the original deer It belongs to the second class. However, the documents are divided into two categories. For the patentee's special, for example, 'Taiwan's integrated circuit company's English name, there may be TSMC τ. For example, its registered patent rights C, TaiWan Semic〇nduct()r

Taiwan Semiconductor limited company ^ . A ^ 電路公司,且直φ 句代表台灣積體 ❶爪以町亦可使用簡寫,CO.,來加以# 代。在如此多之登記名稱下,傳統之自動:加以替 使用不同專利權人名稱進行登記之專利,即使心專^將 應分類於同—公司名稱,卻將其誤㈣ ^右】本 因此,傳統之分類方法,在分類完成後,還有。 之幫助再行將冲主门^ 還4藉由使用者 利番雜、表同一專利權人,卻被歸屬於不同類之袁 =新为在一起,如此反增加使用者之負擔。 專 因此,如何在不增加使用者操作負擔 正確性及成為追求之目標。 胃進刀類之 【發明内容】 1314691 因此,本發明之主要目的就是在提供一種字串鏈結方 法,可鏈結代表同一目標名稱之各名稱。 根據本發明之字串鏈結方法,包括:輸入欲進行分類 之判別字串;根據所輸入之判別字串,計算各判別字串出 現次數;使用一簡稱對照表,將判別字串中特定之簡寫還 原;運算兩兩字串間之相似度;將兩字串相似度值運算結 果高於-臨界值之兩字串判定為同—字串;以及持續進行 相似度比對直至兩兩字串間之相似度均低於臨界值。 〜根據-實施例’其中具有較少判別字串出現次數之對 應字串會被消滅,而存留具有較多制字串出現次數之對 應予串。 根據-實施例,運算兩兩字串間之相似度,更包括使 用動態規劃演算法,來找出兩字串間之最長共通字串;以 及將此共通字串長度進行正規化以獲至—相似度值。 综合上述所言,本發明之方法,可將代表同一事物之 =種=同判別名稱進行鏈結並將其歸類入一名稱下,因 此1用於分類時,可將原歸屬不同判別名稱之物併入— 名稱下,增進分類之精確性。 【實施方式】 專以下將以搜尋並分類以台灣積體電路公司為專利權人 )案之實施例來說明本發明之應用。然值得注意的是, 發明並非僅限於應用在上述之實施例中。 為5參閱f1圖所不為本發明將不同名稱進行鍵結並歸類 ’、、同一名稱之概略流程圖。根據本實施例,台灣積體電路 1314691 公司其所可能登記之專利權人英文名稱,有TSMC,Taiwan Semiconductor company, Taiwan Semiconductor limited company, Taiwan Semiconductor co., Taiwan Semiconductor LTD. company,或 Taiwan Semiconductor LTD. Co,等均代表 台灣積體電路公司。在傳統分類方法下,這些不同名稱會 被判定成不同之專利權人,而使得專利分類失真,而本發 明即是用以解決此問題。 首先在步驟101,使用者需輸入欲進行搜尋分類之專利 權人名稱,亦即輸入欲進行分類之判別字串。例如,在搜 尋出之一定數量專利中,專利分析人員可能想要知道某些 專利專利權人名下所擁有之專利,此時即可將該些專利權 人之可能名稱輸入。 接著在步驟102中,本發明之方法,即可根據所輸入 之判別字串,計算各判別字串出現次數,由於一篇專利專 利專利權人名稱僅出現一次’因此計算各專利專利權人名 稱次數’亦即計算對應各專利專利權人名稱之專利數目, 並將各專利分類至對應之專利權人名稱下。以上述台灣積 體電路公司專利權人名稱為例,在此步驟中,會分別計算 台灣積體電路公司不同專利權人名稱,例如,TSMC, Taiwan Semiconductor company等專利權人名稱出現次數,並將專 利分類至對應之專利權人名稱下。 接著於步驟103,使用一簡稱對照表1 〇4,將判別字串 中特定之間寫還原。以上述實施例為例,例如,limited之 簡寫為LTD.、company之簡寫為 Co.或Taiwan Semiconductor company之簡寫為TSMC等,這些特定之簡 1314691 寫與所對應之還原字串,均可由使用者於簡稱對照表104 中加以定義,並於步驟103中根據這些簡寫與還原字串間 之定義,將判別字串中之簡寫對應還原。例如,Taiwan Semiconductor LTD. Co.在經由步驟103之還原過程後,其 所呈現之字串將成為 Taiwan Semiconductor limited company。此時 Taiwan Semiconductor LTD. Company 和 Taiwan Semiconductor LTD. Co在經此還原後,會被視為相 同之專利權人。 由於一般在輸入字串時無可避免的會因人為之疏失而 造成輸入字串之錯誤,同樣的,在專利申請之過程中,亦 有可能因人為之誤繕,而造成專利權人名稱遺漏掉某一字 母,例如在輸入「semiconductor」字時,誤輸入成 「semiconducter」,而造成應歸屬同一專利專利權人之兩專 利,被判定成非屬同一專利專利權人。因此在本發明之步 驟105中,會藉由運算兩兩字串間之相似度,並將相似度 大於一特定值以上之兩字串視為同一字串,來彌補此些可 能發生之人為錯誤。 在一實施例中,本發明判別兩兩字串間相似度之方 法,係於步驟107中,藉由使用一習知之動態規劃演算法, 來找出兩字串間之最長共通字串,例如,以「Taiwan Semiconductor company」 與「Taiwan Semiconducter company」兩字串而言,其中兩字串長度均為26,而兩字 串間之最長共通字串為「Taiwan semiconductr company」, 其字串長度為25。接著,再於步驟108中,將此共通字串 長度進行正規化,例如,將此共通字串長度,25,除以兩 1314691 字串長度之平均值,26,以獲至一比值96%,亦即相似度 值為96%。其中’在經過步驟1〇3還原後之各判別字串, 兩兩字串間均會進行上述字串間之相似度運算。 接著’於步驟109’會將兩字串相似度值運算結果高於 臨界值之兩字串判定為同_字串。假設於—實施例中, 所叹疋之fe界值為80%,以上述之例子而言其相似度值為 96%’大於所設定之臨界值為8〇%。因此,「丁31丽 Sem—tor company」和「加觸 咖 _pany」兩字串會被視為同—字串,此時,分別對應上述 字串之兩組專利會彼此合併,而成為一組專利。根據本實 施例,由於因誤繕所造成之專利權人名稱錯誤,分類在此 誤譜之專利權人名稱下之專利數目畢竟較少。因此,於步 2⑽中之兩字串對應專利彼此合併,其中具有較少專利 對應Μ會被消滅,而存留具有較多專利數目之對 易言之,錄少專概目之專频會併人具較多 併舍拉I之專利組内。上述之相似度比對與對應專利之合 ^值’直至兩兩字串間之相似度均低於所述定之 臨界值,如步驟110所述。 值得注意料,在本發明之相 二:=界 仁了透過另一字串,如第三字串,因第—與 以及第二與第三字串間’彼此之相似度值 值,而使得第-與第二字串被大於臨界 利得以合併-起,而視為同-專_人=:其對應之專 綜合上述所言,本發明之方法,可將代表同一事物之 1314691 各種不同判別名稱 此,當用於專利分類日/肖並將其歸類人—名稱下’因 利合併人—名稱㈣可將原歸屬不同判別名稱下之專 雖然本發明已以―較佳,精確性。 以限定本發明,任何熟習此==如上,然其並非用 護範圍當視後二=更動與潤傳,因此本發明之保 之申叫專利範圍所界定者為準。 【圖式簡單說明】 為讓本發明之上述和其他目的、 能更明顯易懂,所附圖式之詳細說明如^ _ ‘點與實施例 第1圖所示為本發明將不同名稱進行鏈結並歸類為同一名 稱之概略流程圖。 為问一名 【主要元件符號說明 101〜110步驟 10Taiwan Semiconductor limited company ^ . A ^ Circuit company, and the straight φ sentence represents the Taiwanese body. The claws can also use abbreviations, CO., to add # generation. Under so many registration names, the traditional automatic: to use the patents registered with different patentee names, even if the mind should be classified in the same company name, but it will be wrong (four) ^ right] this, the traditional The classification method, after the classification is completed, is still there. The help will be rushed to the main door. ^4 By using the user, the same patentee, but belonging to different classes of Yuan = new together, thus increasing the burden on the user. Therefore, how to improve the user's operational burden is correct and become the goal of pursuing. Stomach Feeding Type [Summary Content] 1314691 Accordingly, it is a primary object of the present invention to provide a string linking method that can be linked to names representing the same target name. According to the string linking method of the present invention, the method comprises: inputting a discriminant string to be classified; calculating a number of occurrences of each discriminant string according to the input discriminant string; using a short form comparison table, discriminating the specific one in the string Abbreviation reduction; calculating the similarity between two strings; determining two strings of two string similarity values above the -threshold as the same-string; and continuing the similarity comparison until the two-string The similarity is below the critical value. The corresponding string in which the number of occurrences of the discriminant string is less than the number of occurrences of the discriminating string is eliminated, and the corresponding number of occurrences of the number of string strings is retained. According to the embodiment, the similarity between the two strings is calculated, and the dynamic programming algorithm is used to find the longest common string between the two strings; and the length of the common string is normalized to obtain the similarity value. In summary, the method of the present invention can link the same type of the same thing to the same name and classify it into a name, so when 1 is used for classification, the original distinguished name can be assigned. Incorporate – under the name, to improve the accuracy of the classification. [Embodiment] The application of the present invention will be described below by way of an example in which the search and classification of the Taiwan Semiconductor Circuit Company is the patentee. It is to be noted that the invention is not limited to the application in the embodiments described above. For the sake of 5, refer to the f1 diagram, which is not a schematic flow chart for naming and naming different names for the present invention. According to the present embodiment, the Taiwanese entity circuit 1314691 company may register the English name of the patentee, TSMC, Taiwan Semiconductor company, Taiwan Semiconductor limited company, Taiwan Semiconductor co., Taiwan Semiconductor LTD. company, or Taiwan Semiconductor LTD. Co, etc. represent Taiwan Integrated Circuit Company. Under the traditional classification method, these different names are judged to be different patent holders, which makes the patent classification distorted, and the present invention solves this problem. First, in step 101, the user needs to input the name of the patentee who wants to perform the search classification, that is, input the discriminant string to be classified. For example, in a certain number of patents searched, patent analysts may want to know the patents owned by certain patent holders, and the possible names of these patent holders can be entered at this time. Next, in step 102, the method of the present invention can calculate the number of occurrences of each discriminant string according to the input discriminant string, since the name of a patent patent holder appears only once, thus calculating the name of each patent holder. The number of times' is the number of patents corresponding to the names of the patent holders, and the patents are classified under the name of the corresponding patentee. Taking the name of the patentee of Taiwan Integrated Circuit Company as an example, in this step, the names of different patent holders of Taiwan Integrated Circuit Company, for example, the number of patentee names such as TSMC, Taiwan Semiconductor company, etc., will be calculated. The patent is classified under the name of the corresponding patentee. Next, in step 103, a specific write-on-write in the discriminant string is restored using a shorthand comparison table 1 〇4. Taking the above embodiment as an example, for example, the abbreviation of limited is LTD., the abbreviation of company is Co. or the abbreviation of Taiwan Semiconductor company is TSMC, etc., and the specific simplified 1131691 written and corresponding restored string can be used by the user. It is defined in the abbreviation comparison table 104, and in step 103, the shorthand correspondence in the discriminant string is restored according to the definition between these shorthand and restored strings. For example, after the reduction process of Taiwan Semiconductor LTD. Co. via step 103, the string presented will become Taiwan Semiconductor limited company. At this time, Taiwan Semiconductor LTD. Company and Taiwan Semiconductor LTD. Co will be regarded as the same patentee after being restored. Since it is inevitable that the input string will be mistaken due to human error when inputting the string, similarly, in the process of patent application, the name of the patentee may be missed due to human error. When a letter is deleted, for example, when the word "semiconductor" is input, it is mistakenly entered as "semiconducter", and two patents belonging to the same patentee are deemed to be not the same patent holder. Therefore, in step 105 of the present invention, such human error may be compensated for by calculating the similarity between the two strings and treating the two strings having the similarity greater than a certain value as the same string. In an embodiment, the method for determining the similarity between two strings is performed in step 107 by using a conventional dynamic programming algorithm to find the longest common string between two strings, for example, In the case of the "Taiwan Semiconductor Company" and the "Taiwan Semiconducter company", the length of the two strings is 26, and the longest common string between the two strings is "Taiwan semiconductr company", which has a string length of 25. Then, in step 108, the common string length is normalized, for example, the common string length, 25, divided by the average of the two 1314691 string lengths, 26, to obtain a ratio of 96%. That is, the similarity value is 96%. Among them, the discriminant string after the step 1〇3 is restored, and the similarity between the strings is performed between the two strings. Then, in step 109, the two strings of which the two-string similarity value operation result is higher than the critical value are judged as the same_string. It is assumed that in the embodiment, the feb value of the sigh is 80%, and in the above example, the similarity value is 96%' greater than the set threshold value of 8〇%. Therefore, the words "Ding 31 Li Sem-tor company" and "Plus _pany" will be regarded as the same-string. At this time, the two sets of patents corresponding to the above-mentioned strings will merge with each other and become one. Group patent. According to this embodiment, the number of patents classified under the name of the patentee of this misunderstanding is less, due to the wrong name of the patentee due to misunderstanding. Therefore, the two strings in step 2 (10) correspond to each other's patents, and those with fewer patents will be eliminated, while the number of patents with more patents will be kept, and the special frequency will be recorded. There are more patents in the patent group. The similarity between the above-mentioned similarity comparisons and the corresponding patents until the similarity between the two strings is lower than the predetermined threshold, as described in step 110. It should be noted that in the second phase of the present invention: = the boundary is passed through another string, such as the third string, because of the similarity values between the first and the second and third strings, - the second string is merged with greater than the critical profit, and is considered to be the same - the special = person =: its corresponding combination of the above said, the method of the present invention, can represent the same thing 13146691 various distinguished names Therefore, when used in the patent classification date / Xiao and its categorization - under the name 'Ingli merged person - name (4) can be attributed to the original different identification name, although the invention has been "better, accurate." In order to limit the present invention, any familiarity with this == as above, but it is not the scope of protection, and the second is the change and the pass, and therefore the scope of the patent application of the present invention is subject to the definition. BRIEF DESCRIPTION OF THE DRAWINGS In order to make the above and other objects of the present invention more comprehensible, the detailed description of the drawings is as shown in the accompanying drawings. The summary is a summary flow chart of the same name. To ask a [main component symbol description 101~110 step 10

Claims (1)

1314691 十、申請專利範圍: 1·種子串鏈結方法,係用於判別複數個不同字 應歸屬為同一字串之方法,包括: 原; 使用一簡稱對照表,將存於該些字串中特定之簡寫還 運算兩兩字串間之相似度值; 將兩字串間相似度值高於_臨界值之兩字串判 一字串;以及 u 持續進行相似度值比對直至任兩字串間之相 於該臨界值。 又 μ如申明專利範圍帛1項所述之字串鍵結方法,其中 運算兩兩字串間之相似度值,更包括: 使用動態規劃演算法’來找出兩字串間之最長共通字 串;以及 將該共通字串長度進行正規化以獲JL該相似度值。 3·如申請專利範圍帛2項所述之字串鏈結方法,其中 將此共通字串長度進行正規化更包括: 計算兩字串之平均字串長度;以及 將此共通字串長度除以該平均字串長度。 4·如申明專利範圍帛1項所述之字串鏈結方法,更包 括計算各判別字串出現次數。 、1314691 5·如申請專利範圍第1項所述之字串鏈結方法,更包 括定義該簡稱對照表。 6.—種字串鏈結方法,係用於分類複數篇技術文件, 其中每一篇技術文件具有一判別字串,包括: 輸入欲進行分類之判別字串; 根據所輸入之判別字串,計算各判別字串所對應之 術文件數目; & 原; 使用一簡稱對照表,將該些判別字串中特定 之簡寫還 運算兩兩判別字串間之相似度值; 以及 字兩判別字串間相似度值高於一臨界值之兩判別字串 判疋為同—字串,並合併該兩判別字串對應之技術文件; 低於行相似度比對直至兩兩判別字串間之相似度均 7·如申請專利範圍“項所述之字串鏈結方法 連算兩兩判別字串間之相似度值,更包括: 八 通字^用動態規劃演算法,來找出兩判別字串間之最長共 中’以及 將遠共通字串長度進行正規化以獲至該相似度值。 其中 8·如申請專利範圍帛7項所述之字串鏈結方法, 12 1314691 將此共通字串長度進行正規化更包括: 計算兩判別字串之平均字串長度;以及 將该共通字串長度除以該平均字串長度。 9_如申請專利範圍第6項所述之字串鏈結方法,其中 合併該兩判別字串對應之技術文件,更包括具有較少技術 文件數目之對應判別字串會被消滅,而存留具有較多技術 文件數目之對應判別字串。 10,如申請專利範圍第6項所述之字串鏈結方法,更包 括疋義該簡稱對照表。1314691 X. Patent application scope: 1. The seed string linkage method is a method for discriminating that a plurality of different words should belong to the same string, including: original; using a short list of comparison tables, which will be stored in the strings The specific shorthand also computes the similarity value between the two strings; the two-string similarity value is higher than the _threshold value of the two-string string; and u continues the similarity value comparison until any two strings In relation to the critical value. In addition, the string concatenation method described in the scope of claim ,1, wherein the operation of the similarity value between the two strings, further includes: using the dynamic programming algorithm to find the longest common string between the two strings; And normalizing the common string length to obtain the similarity value of JL. 3. The method according to claim 2, wherein the normalizing the length of the common string further comprises: calculating an average string length of the two strings; and dividing the length of the common string by The average string length. 4. The method for string linking as described in claim 1 of the patent scope further includes calculating the number of occurrences of each discriminant string. , 13146691 5. The method for string linking as described in claim 1 of the patent application, and the definition of the abbreviation reference table. 6. The method for chaining a string is used to classify a plurality of technical documents, wherein each of the technical files has a discriminant string, including: inputting a discriminant string to be classified; according to the discriminant string input, Calculating the number of surgical files corresponding to each discriminant string; &original; using a short name comparison table, the specific abbreviations in the discriminant string are also used to calculate the similarity value between the two discriminant strings; The two discriminant strings whose similarity value is higher than a critical value are judged as the same-string, and the technical files corresponding to the two discriminant strings are merged; the similarity between the lower-order similarity comparisons until the two-two discriminant strings are 7. If the word stringing method described in the patent application scope calculates the similarity value between the two pairs of discriminant strings, it also includes: the eight-pass word ^ using the dynamic programming algorithm to find the longest between the two discriminant strings And the length of the far-common string is normalized to obtain the similarity value. Among them, the string linking method described in the patent application 帛7 item, 12 1314691, the common string length Performing normalization further includes: calculating an average string length of the two discriminant strings; and dividing the common string length by the average string length. 9_such as the string concatenation method described in claim 6 The technical file corresponding to the two discriminant strings is merged, and the corresponding discriminant string including the number of fewer technical files is destroyed, and the corresponding discriminant string having the number of technical files is retained. The string linking method described in the six items further includes the abbreviation reference table. (S) 13(S) 13
TW95128925A 2006-08-07 2006-08-07 A method for string linking TWI314691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW95128925A TWI314691B (en) 2006-08-07 2006-08-07 A method for string linking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW95128925A TWI314691B (en) 2006-08-07 2006-08-07 A method for string linking

Publications (2)

Publication Number Publication Date
TW200809540A TW200809540A (en) 2008-02-16
TWI314691B true TWI314691B (en) 2009-09-11

Family

ID=44767177

Family Applications (1)

Application Number Title Priority Date Filing Date
TW95128925A TWI314691B (en) 2006-08-07 2006-08-07 A method for string linking

Country Status (1)

Country Link
TW (1) TWI314691B (en)

Also Published As

Publication number Publication date
TW200809540A (en) 2008-02-16

Similar Documents

Publication Publication Date Title
US11327975B2 (en) Methods and systems for improved entity recognition and insights
Murakami et al. Gapped code clone detection with lightweight source code analysis
JP5768063B2 (en) Matching metadata sources using rules that characterize conformance
Li et al. Efficient shapelet discovery for time series classification
US8527436B2 (en) Automated parsing of e-mail messages
US20090049144A1 (en) Apparatus, method and computer program product for processing email, and apparatus for searching email
WO2021169186A1 (en) Text duplicate checking method, electronic device and computer-readable storage medium
US8918402B2 (en) Method of bibliographic field normalization
Wang A re-examination of dependency path kernels for relation extraction
Branting A comparative evaluation of name-matching algorithms
CN110286934A (en) A kind of inspection method and device of static code
Blohm et al. Harvesting relations from the web-quantifiying the impact of filtering functions
TW200837581A (en) Verifying method for reliability of patent data
TWI314691B (en) A method for string linking
US10511563B2 (en) Hashes of email text
Branting Name-Matching Algorithms for Legal Case-Management Systems', Refereed article
Lee et al. Approximate substring selectivity estimation
He et al. Repair diversification: A new approach for data repairing
WO2019223597A1 (en) Method and device for annotation information determination and prefix tree construction
JP2008090396A (en) Electronic document retrieval method, electronic document retrieval device, and program
Schedl et al. Automatically detecting members and instrumentation of music bands via web content mining
CN110704522B (en) Concept data model automatic conversion method based on semantic analysis
CN110502629B (en) LSH-based connection method for filtering and verifying similarity of character strings
Zhang et al. Smooth q-Gram, and Its Applications to Detection of Overlaps among Long, Error-Prone Sequencing Reads
Liu et al. Duplicate identification in deep web data integration

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees