TW201124860A

TW201124860A - Method and apparatus for identifying synonym, and searching method and apparatus utilizing the same.

Info

Publication number: TW201124860A
Application number: TW99100270A
Authority: TW
Inventors: Jing Dong; Fei Xing; Ning Guo; Lei Hou; Qin Zhang
Original assignee: Alibaba Group Holding Ltd
Priority date: 2010-01-07
Filing date: 2010-01-07
Publication date: 2011-07-16
Also published as: TWI471739B

Abstract

This invention discloses a method and apparatus for identifying synonym and a searching method and apparatus utilizing the same. The disclosed method includes: (a) obtaining arbitrary two Chinese words to be identified;(b) determining whether a shortest edit distance between the two Chinese words less than or equal to an edit distance threshold, and then executing step (c); (c) determining whether the two Chinese words to be identified exist in a preset knowledge database, and if the answer is yes then searching a smallest granularity type with highest weight value for each Chinese word in the knowledge database; and (d) if the two Chinese words have the same smallest granularity type with highest weight value, then determining such two Chinese words are synonyms, or non-synonym otherwise. The disclosed techniques greatly improve accuracy of synonym identification and ensure the synonym identification.

Description

201124860 六、發明說明：【發明所屬之技術領域】本申請涉及電腦資料處理技術領域，特別涉及一種識別中文同義詞的方法和裝置及利用其進行搜索的方法和^ 置。【先前技術】現有的搜索一般是基於關鍵字的搜索，即用戶輸入關鍵字讓搜索引擎進行查詢，搜索引擎返回包含有這些關鍵字的結果網頁。比如用戶輸入“數位照相機”，現有的中文搜索引擎會對輸入的關鍵字先進行分詞，通常將“數位照相機”分詞爲“數位丨照相機”兩個詞條，之後返回的結果網頁中包含有"數位”和“照相機”兩個詞條。而實際上，不同的用戶背景不同，習慣不同，很有可能他們意圖相似而壓縮表達出來的用於查詢的關鍵字卻不一樣。比如，查詢“數位照相機”和“數位相機”的用戶潛在意圖是完全一致的，而對於“數位照相機”，現有的搜索引擎返回的結果網頁中包含有“數位”和“照相機” 兩個詞條，而有一些很有價値的結果網頁，因爲包含有“ 數位”和“相機”兩個詞卻沒有被返回或者由於其他技術因素返回卻沒有排在很靠前的位置。如果搜索引擎能發現這對詞是組同義詞，同時合倂返回兩個片語的結果網頁，那麼對於提高搜索的準確度，以及用戶搜索體驗是非常有效的。 -5- 201124860 同義詞是自然語言中的一個獨特現象，同義詞控掘在自然語言處理中也是一個非常有意義的工作，它的實現對於搜索查詢重寫’豐富搜索結果以使得用戶得到很好的查詢體驗有很大的幫助。但是搜索應用中所涉及的同義詞替換必須把握的恰如其分，並不是使用任何一個近義詞表就可以解決的。因爲用戶已經習慣於關鍵字搜索，習慣於輸入查詢後’和查詢相同的字、詞在結果條目中標紅；那麼即使是完全同義的不同字、詞替換，也不是每個用戶都能接受的。例如：“土豆”和“馬鈴薯”是完全同義，但用戶輸入“ 土豆” ’而“馬鈴薯”卻赫然出現在結果條目中被標紅，猛一看還以爲搜索引擎出問題了，如果不被標紅又很容易被用戶的眼睛跳過。所以本文所涉及的同義詞是指應該適合搜索應用的同義詞。目前存在的漢語同義詞自動識別方法是，將每個詞表示成一個網頁，詞典中對該詞解釋的其他詞和這個詞形成一種鏈結關係，給每個詞賦予一個分値，這個分値就代表了詞之間的相似度，也就是說，把辭粲之間解釋與被解釋的關係看成是一種超鏈結，把頁面等級（PageRank)値看成是體現辭粲之間語義相似性的衡量指標，然後根據語義相似度的大小識別同義詞°這種方法主要是藉由PageRank 値作爲衡量同義詞的指標，而PageRank値的確是依賴於其所能獲得的資源的，而這種資源又有很大的隨意性難以控制，例如，對“ 土豆”的解釋，如果所用的資源著重解釋土豆的植物特性，外形特徵等，那麼很有可能“ 土豆” -6- 201124860 會和“根莖”，“橢圓”等詞建立近義詞關係。因而這種體現鏈結關係的PageRank値是非常不可靠的，並且這種不可靠資訊很難自動檢測，從而導致不能準確識別所需的同義詞’使得識別的效果很難得到保證。【發明內容】本申請實施例一方面在於提供一種識別中文同義詞的方法和裝置，以解決中文同義詞識別效果不能得到保證的問題** 本申請實施例另一方面在於提供一種搜索方法及裝置 ’以豐富搜索結果資訊。本申請實施例提供了一種識別中文同義詞的方法，包括： a、計算伺服器獲得需要識別的任意兩個中文詞； b、確定所述兩個中文詞之間的最小編輯距離小於等於編輯距離閾値後，執行步驟c : c、判斷所述需要識別的兩個中文詞是否都存在於預設的知識庫中，若是，則在所述知識庫中分別查找每個中文詞的權重最大的最小粒度類型； d、若查詢到的每個中文詞的權重最大的最小粒度類型相等’則判定所述兩個中文詞爲同義詞，否則判定所述兩個中文詞爲非同義詞。其中’若需要識別的兩個中文詞是否不都存在於預設的知識庫中，則進一步包括： 201124860 e、計算伺服器對不能查到的中文詞進行分詞，再判斷所述分詞後的中文詞是否都存在於所述知識庫中，若是，則再在所述知識庫中分別査找每個中文詞的權重最大的最小粒度類型，並繼續後續步驟e » 其中，當判斷出每個中文詞的權重最大的最小粒度類型相等後，進一步包括：計算伺服器判斷兩個中文詞中有變化的字或詞是否屬於已設置的普義字表中可以改變的字，若是，再判定所述需要識別的兩個中文詞爲同義詞，否則判定所述兩個中文詞爲非同義詞。其中，所述知識庫包括：詞條和槪念，每個詞條或槪念至少對應一個類型，且每個詞條或槪念對應的每個類型具有權重値。其中，所述在所述知識庫中分別查找每個中文詞的權重最大的最小粒度類型包括：在所述知識庫中查到與每個中文詞對應的詞條或槪念，根據每個詞條或槪念對應的至少一個類型，及每個詞條或槪念具有的權重値，査到每個中文詞的權重最大的最小粒度類型。其中，若判定所述兩個中文詞爲同義詞，則將所述識別出的同義詞存入同義詞庫》本申請實施例還提供了一種搜索方法，包括：搜索引擎接收來自用戶的查詢請求，所述査詢請求中包括待查詢詞條； -8 - 201124860 搜索引擎根據所述待查詢詞條查詢預先設置的同義詞庫’找到該待查詢詞條的同義詞；搜索引擎應用所述待查詢詞條和該待查詢詞條的同義詞進行搜索，返回包括該待查詢詞條和該待查詢詞條同義詞的搜索結果給用戶。本申請實施例還提供了一種識別中文同義詞的裝置，包括：獲取單元，用於獲得需要識別的任意兩個中文詞；第一判斷單元，用於確定所述兩個中文詞之間的最小編輯距離小於等於編輯距離閩値後，通知第二判斷單元：第二判斷單元，用於判斷所述需要識別的兩個中文詞都存在於預設的知識庫中時，通知查詢單元；查詢單元，用於在所述知識庫中分別査找每個中文詞的權重最大的最小粒度類型；第三判斷單元，用於確定查詢到的每個中文詞的權重最大的最小粒度類型相等時，判定所述兩個中文詞爲同義詞’確定查詢到的每個中文詞的權重最大的最小粒度類型不相等時’判定所述兩個中文詞爲非同義詞。其中，所述裝置還包括：分詞單元，用於對不能在所述知識庫中查到的中文詞進行分詞，之後通知第二判斷單元；所述第二判斷單元，還用於判斷出所述分詞後的中文詞都存在於所述知識庫中時，再通知查詢單元，判斷出所述分詞後的中文詞不都存在於所述知識庫中時，再通知分 -9 - 201124860 詞單元。其中，所述裝置還包括：普義字表查詢單元，用於確定兩個中文詞中有變化的字或詞屬於已設置的普義字表中可以改變的字時，通知第三判斷單元判定所述兩個中文詞爲同義詞，確定兩個中文詞中有變化的字或詞不屬於已設置的普義字表中可以改變的字時，通知第三判斷單元判定所述兩個中文詞爲非同義詞。其中，所述知識庫包括：詞條和槪念，每個詞條或槪念至少對應一個類型，且每個詞條或槪念對應的每個類型具有權重値。其中，所述識別中文同義詞的裝置爲計算伺服器或搜索引擎。本申請實施例還提供了一種搜索裝置，包括：接收單元，用於接收來自用戶的査詢請求，所述查詢請求中包括待查詢詞條；同義詞查詢單元，用於根據所述待查詢詞條查詢預先設置的同義詞庫，找到該待查詢詞條的同義詞；搜索單元，用於應用所述待查詢詞條和該待查詢詞條的同義詞進行搜索；回饋單元，用於將所述搜索結果返回給用戶。應用本申請實施例提供的識別中文同義詞的方法及裝置’由於首先確定待識別中文詞之前的最小編輯距離，因而使得同義詞對之間的字詞表達差異不大，在搜索應用中能夠提高搜索結果的準確性，並且不會給用戶帶來突兀的 -10- 201124860 感覺’再有，本申請實施例利用知識庫對待識別的中文詞進行語義的驗證，使得識別出的同義詞準確率大大提高，保證了同義詞的識別效果。應用本申請實施例提供的搜索方法及裝置，既避免了在搜索中給用戶帶來突兀的感覺，又豐富了搜索結果，使返回的搜索結果更符合用戶的需求。【實施方式】下面將結合本申請實施例中的圖式，對本申請實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本申請一部分實施例，而不是全部的實施例。基於本申請中的實施例，本領域普通技術人員在沒有作出創造性勞動前提下所獲得的所有其他實施例，都屬於本申請保護的範圍。本申請使用最小編輯距離的演算法，利用定義較小的編輯距離，使得同義詞對之間的字詞表達差異不大，在搜索應用中不會給用戶帶來突兀的感覺，另外，利用知識庫的淺層語義槪念驗證，使得同義詞準確率大大提高，抽取的同義詞表對於搜索等相關應用有很好的效果，當然也可以運用到除搜索以外的其他領域中。其中，編輯距離是指從一個字串變化到另一個字串最少需要的基本操作次數，或理解爲兩個字串差異部分的長度之和。通常的基本操作包括插入一個字/詞，刪除一個字/詞，替換一個字/詞，或者其他根據需要而設的操作。 -11 - 201124860 例如從“我愛你”變化到“我不愛她”至少需要插入一個 “不”、將“你”替換成“她”兩次基本操作，因此兩者的編輯距離爲2，同理，“隱形的翅膀”和“好吃的雞翅膀”的編輯距離爲3。該編輯距離的計算過程，即插入— 個字/詞，刪除一個字/詞’替換一個字/詞的過程完全是現有技術。參見圖1，其是根據本申請一個實施例的識別中文同義詞的流程圖。本實施例的目的是爲了識別出待識別的兩個中文詞之間是否爲同義詞，具體步驟如下：步驟101，計算伺服器獲得需要識別的任意兩個中文詞；這裏，通常是從搜索引擎的查詢曰誌中獲取任意的兩個中文詞，爲了提高效率，可以在引擎的查詢日誌選出輸入次數最多的〗0萬條詞條，將這1 0萬條詞條兩者之間一一比對。上述計算伺服器可以是搜索引擎本身，也可以是專門用戶同義詞比對的伺服器，還可以是其他具有計算功能的伺服器。步驟102，計算伺服器確定需要識別的兩個中文詞之間的最小編輯距離小於等於編輯距離閾値後，執行步驟 103 ；若需要識別的兩個中文詞之間的最小編輯距離大於編輯距離閩値，則直接判定兩個中文詞爲非同義詞。這裏，編輯距離閾値可以是1、2、3等。可以理解’ -12- 201124860 編輯距離越小，兩個詞之間的變化越小。歩驟1 〇3 ’計算伺服器判斷所述需要識別的兩個中文詞是否都存在於預設的知識庫中，若是，則執行步驟104 ♦ 其中，有關知識庫的內容在後面具體介紹。步驟1 04，計算伺服器在所述知識庫中分別查找每個中文詞的權重最大的最小粒度類型；其中，有關知識庫內中文詞的權重、及粒度類型在後面介紹。步驟1 05，計算伺服器若查詢到的每個中文詞的權重最大的最小粒度類型相等，則判定所述兩個中文詞爲同義詞，否則判定所述兩個中文詞爲非同義詞。需要說明的是，若需要識別的兩個中文詞不都存在於預設的知識庫中，則進一步包括：步驟106，計算伺服器對不能查到的中文詞進行分詞，再判斷所述分詞後的中文詞是否都存在於所述知識庫中，若是，則繼續後續步驟即執行步驟104，否則再次執行步驟106。需要說明的是，本實施例可以進一步包括：將識別出的同義詞存入同義詞庫，以備後續應用。其中同義詞庫可以以資料表的形式對識別出的同義詞進行保存。對於保存同義詞的資料表，一種可能的實現方式是，讓互爲同義詞的詞條對應保存，這樣可以方便查詢。例如，詞條A1和 A2，B1和B2，C1和C2、C3互爲同義詞，其保存方式可參見表1， -13- 201124860 表1 序號查詢詞條同義詞1 同義詞2 1 A1 A2 2 A2 A1 3 B1 B2 4 B2 B1 5 C1 C2 C3 6 C2 C1 C3 7 C3 C1 C2 當然，保存同義詞的資料表並不限於表1所示保存方式’本文並不對同義詞的具體保存形式做限定，只要能夠保證待査詢詞條的同義詞能夠被及時查到即可。應用本申請實施例提供的識別中文同義詞的方法，由於首先確定待識別中文詞之前的最小編輯距離，因而使得同義詞對之間的字詞表達差異不大，在搜索應用中不會給用戶帶來突兀的感覺，再有，本申請實施例利用知識庫對待識別的中文詞進行語義的驗證，使得識別出的同義詞準確率大大提高’保證了同義詞的識別效果。應用本申請所確定的中文同義詞，不僅僅可以應用到搜索相關領域，還可以應用到其他領域中。參見圖2’其是根據本申請的一較佳實施例的流程圖。具體如下：步驟201’計算伺服器獲得需要識別的任意兩個中文詞； -14- 201124860 一般選擇的待識別的中文詞對是在日誌中經常出現的高頻詞，比如出現次數大於等於20次的詞，因爲高頻詞很具有代表性，多次出現保證了待識別的中文詞不是生僻詞；另外中文詞的字數最好不超過一定數目，比如字數小於等於8;這是爲了使得後面能夠快速的計算編輯距離，並且再長的詞條出現同義情況槪率較小。這裏，需要識別的兩個中文詞來自搜索引擎的查詢曰誌。步驟202，計算伺服器計算需要識別的兩個中文詞之間的最小編輯距離。這裏，可以應用現有的動態規劃演算法計算兩個中文詞之間的最小編輯距離，當然也可以使用其他的演算法，在此，並不對計算最小編輯距離的具體演算法做限定。在現有的動態規劃演算法中，每個詞的最小單位爲一個字，比如兩個詞 W1 ,W2 ;他們的字組成爲clc2c3， dld2d3，那麼 clc2c3，dld2d3 之間的最短距離 Dis (clc2c3，dld2d3)可以由他們的子串的最短距離而得，具體的計算方法爲如果 c3=d3， Dis(clc2c3, dld2d3)=DIS ( clc2，dld2) +1 ’ 如果不相等，Dis(clc2c3, dld2d3) = Max (Dis(clc2，dld2d3), Dis(clc2c3，dld2))，其中 Max 是指選其中的最大値，這就是動態規劃演算法。步驟2〇3，計算伺服器判斷所計算出的最小編輯距離是否小於等於編輯距離閩値，若是，則執行步驟2 04 ;否則判定需要識別的兩個中文詞爲非同義詞。 -15- 201124860 在一個較佳實施例中，令編輯距離閩値等於1。因爲本實施例所識別的同義詞主要用於查詢（query )重寫的搜索應用中，査詢重寫簡單的說就是把用戶輸入的關鍵字用它的同義詞替代搜索，這樣做可以增加結果的召回率，並且可以豐富結果集，對於基於分詞的搜索引擎，如用戶搜索“嬰幼兒奶粉”，則含有“嬰兒奶粉”的結果就找不到，而當“嬰幼兒”被同義詞“嬰兒”替代後就可以找到了。因而，如果query重寫後多字被改變卻被標紅，即使重寫後的query和原query意思相差不大，這對於用戶的體驗也是有風險的，因爲用戶已經比較習慣於結果條目中關鍵字被標紅，較大的改變可能會令部分人不適應。因此，這裏推薦所採用的編輯距離閾値等於1，也就是說，如果兩個中文詞爲同義詞，則這兩個中文詞在表達形態變化很小。步驟204，計算伺服器判斷需要識別的兩個中文詞是否都存在於預設的知識庫中，若是，則執行步驟205，否則執行步驟208 ; 上述知識庫實際是一個詞典檔’也可以被稱爲槪念庫，其由詞條和槪念組成。詞條可以被理解爲一個基本的詞，而槪念可以被理解是詞條的組合，但這種組合是日常生活中常用的非常固定的組合。例如’ “蘋果"、“北京” 、“大學”分別是一個詞條，而“北京大學”就是知識庫中的一個槪念。知識庫本身就是一個資料表’每—個表項代表一個詞 -16 - 201124860 ，每一個表項有多個域組成：詞本身，詞的類型，類型的權重。上述知識庫中具有已定義好的至少一個類型，通常是具有幾十個類型，這些類型是分層次的，每個層次對應一個粒度，即由於類型分爲多個層次因而對應不同類型的層次有多個粒度，在此，可以將這種對應多個層次的粒度稱爲粒度類型。其中，類型就是已經定義好的屬性，這些屬性都是參照語言學方面的知識而定義的，所有的詞事先都劃分了所屬的類型。例如，參見圖3，其是根據本申請實施例的知識庫類型層次示意圖.。在本實施例中，“產品”屬於一個較高層次的類型，在本實施例中稱其爲第一層次類型，而"產品-品牌”，“產品-型號”，“產品-規格”，“產品-類型”是位於“產品”這個層次類型下的不同類型，β卩“產品-品牌”，“產品-型號”，“產品-規格”，“產品-類型”是第二層次類型，其位於第一層次類型下。而“產品-類型”下還可以包括第三層次類型的“產品類型-簡單 ”，“產品類型-複合”，“產品類型-統稱”等。本實施例中，位於第三層次類型的“產品類型-簡單”，“產品類型-複合”，“產品類型-統稱”就是最小粒度類型。知識庫中的每個詞條或槪念都會對應到至少一個層次類型，比如，“蘋果”不但屬於“產品類型--簡單”，還屬於“植物”類型，而“汽車”僅屬於“產品類型--統稱 ”，而且，每個詞條或槪念具有權重値，該權重値表明該 -17- 201124860 詞條或槪念屬於該類型的槪率，例如，“蘋果”屬於“產品類型-簡單的權重値是0.38，而屬於“植物”的權重値是0 · 5 4。可以理解，上述知識庫中的類型、類型的層次，以及某個詞條或槪念所屬類型的權重値是藉由經驗積累而獲得的。所謂經驗積累是指，知識庫中的類型、類型的層次等都是參照了語言學方面的知識獲得的，而每個詞的權重是在網頁資源中統計出來的，比如“蘋果”這個詞，它在網頁中以電腦產品的意思出現的次數是60次，以植物出現的次數爲40次’那麼屬於“產品類型-簡單”和“植物” 的權重分別爲〇 · 6和0 · 4。步驟2〇5，計算伺服器在所述知識庫中分別查找每個中文詞的權重最大的最小粒度類型；可以理解，由於知識庫中每個詞條或槪念對應至少一個類型，且每個詞條或槪念具有權重値，因而可以査到每個中文詞的權重最大的最小粒度類型。步驟2 0 6 ’計算伺服器判斷査詢到的每個中文詞的權重最大的最小粒度類型是否相等，若相等，則執行步驟 2 07，否則判定需要識別的兩個中文詞爲非同義詞。可以理解，藉由最小粒度類型，更嚴格約束了待識別中文詞的語義屬性，保證了所識別出同義詞的可靠性。步驟2 〇7，計算伺服器判斷兩個中文詞中有變化的字或詞是否屬於已設置的普義字表中可以改變的字，若是，則判定需要識別的兩個中文詞爲同義詞，否則判定需要識 -18- 201124860 別的兩個中文詞爲非同義詞。普義字表和知識庫類似’也是一個文字檔案，每一行代表一個普義字。普義字表包括可以改變的字和不可以改變字兩部分，其中，可改變字大多是多字詞的尾碼字，並且這些尾碼字出現頻率很高，比如“機” ’ “器”等字；不可改變的字大多是字詞的首碼或尾碼詞，多半會對詞進行轉義’比如：“不”，“非”，“半”等字。該普義字表也是藉由經驗積累，或由人工檢查（review )而得到。可以理解，由於普義字表的存在’更進一步保證了同義詞的識別效果。步驟208，計算伺服器對不能査到的中文詞進行分詞〇這裏，不能查到的中文詞可以是需要識別的兩個中文詞，或是其中任意一個，或是，已經過分詞處理後得到的中文詞。步驟209，計算伺服器判斷上述分詞後的中文詞是否都存在於所述知識庫中，若是，則返回步驟205，否則，再次執行步驟2 0 8。上述識別中文同義詞的方法可以應用在搜索引擎中，也可以應用在需要應用的其他伺服器或設備中。需要說明的是，本實施例可以進一步包括：將識別出的同義詞存入同義詞庫，以備後續應用。其中同義詞庫可以以資料表的形式對識別出的同義詞進行保存。對於保存 -19- 201124860 同義詞的資料表，一種可能的實現方式是，讓兩兩互爲同義詞的詞條——對應保存，這樣可以方便查詢。本文並不對同義詞的具體保存形式做限定，只要能夠保證待查詢詞條的同義詞能夠被及時查到即可。應用圖2所示實施例提供的識別中文同義詞的方法，由於首先確定待識別中文詞之前的最小編輯距離，因而使得同義詞對之間的字詞表達差異不大，在搜索應用中不會給用戶帶來突兀的感覺；再有，本申請實施例利用知識庫對待識別的中文詞進行語義的驗證，即藉由最小粒度類型，更嚴格約束了待識別中文詞的語義屬性，使得識別出的同義詞準確率大大提高，此外，由於利用普義字表對待識別中文詞對中不同的字再次進行驗證，進一步保證了同義詞的識別效果。對於已識別出的同義詞庫，可做如下應用：當搜索引擎接收到用戶輸入的待査詢詞條後，從同義詞庫中找到該待查詢詞條的同義詞，之後，搜索引擎應用用戶輸入的待查詢詞條和該詞條的同義詞分別進行搜索查詢，將兩次查詢結果全部返回用戶，這樣，既避免了在搜索應用中給用戶帶來突兀的感覺，又豐富了搜索結果，使返回的搜索結果更符合用戶的需求，因而可以應用在搜索引擎的査詢重寫應用中。例如，用戶藉由搜索引擎查詢“ 數位照相機”，搜索引擎藉由査詢同義詞庫獲知“數位相機”是“數位照相機”的同義詞，搜索引擎便應用"數位 ♦ 相機”和“數位照相機”分別進行搜索査詢，並返回包含 -20- 201124860 “數位相機”以及“數位照相機”的搜索結果，這樣，不但豐富了搜索結果，而且避免了用戶所需要的資訊被漏掉 0 基於上述應用，本申請還提供了一種搜索方法，參見圖6，具體包括：步驟601，用戶輸入待查詢詞條，向搜索引擎提交查詢請求。步驟602，搜索引擎接收來自用戶的包含待査詢詞條的查詢請求後，根據所述待查詢詞條查詢預先設置的同義詞庫，找到該待查詢詞條的同義詞；步驟603，搜索引擎應用所述待査詢詞條和該待查詢詞條的同義詞進行搜索；步驟6〇4，搜索引擎返回包括該待査詢詞條和該待查詢詞條同義詞的搜索結果給用戶。本申請實施例還提供了一種識別中文同義詞的裝置，參見圖4，具體包括：獲取單元401、第一判斷單元402、第二判斷單元403、查詢單元404和第三判斷單元405，其中，獲取單元401，用於獲得需要識別的任意兩個中文詞 ;在搜索引擎的應用中，搜索的查詢日誌儲存有用戶經常查詢的關鍵字，因而所獲得的需要識別的任意兩個中文詞中，一個來自於用戶在搜索引擎中輸入的關鍵字，另一個是根據該用戶輸入的關鍵字從搜索的查詢日誌中獲得。第一判斷單元402，用於確定所述兩個中文詞之間的 -21 - 201124860 最小編輯距離小於等於編輯距離閾値後，通知第二判斷單元 403 ; 第二判斷單元403，用於判斷所述需要識別的兩個中文詞都存在於預設的知識庫中時，通知查詢單元404 ; 查詢單元404，用於在所述知識庫中分別査找每個中文詞的權重最大的最小粒度類型；第三判斷單元405，用於確定查詢到的每個中文詞的權重最大的最小粒度類型相等時，判定所述兩個中文詞爲同義詞，確定査詢到的每個中文詞的權重最大的最小粒度類型不相等時，判定所述兩個中文詞爲非同義詞。上述裝置還可以包括：分詞單元406，用於對不能在所述知識庫中查到的中文詞進行分詞，之後通知第二判斷單元403 ; 第二判斷單元403，還用於判斷出所述分詞後的中文詞都存在於所述知識庫中時，再通知查詢單元404，判斷出所述分詞後的中文詞不都存在於所述知識庫中時，再通知分詞單元406。上述裝置還可以包括：普義字表査詢單元407，用於確定兩個中文詞中有變化的字或詞屬於已設置的普義字表中可以改變的字時，通知第三判斷單元405判定所述兩個中文詞爲同義詞，確定兩個中文詞中有變化的字或詞不屬於已設置的普義字表中可以改變的字時，通知第三判斷單元405判定所述兩個中文詞爲非同義詞。 -22- 201124860 上述知識庫包括：詞條和槪念，每個詞條或槪念至少對應一個類型，且每個詞條或槪念對應的每個類型具有權重値。上述知識庫、普義字表都是藉由經驗積累而獲得的》上述識別中文同義詞的裝置可以作爲一個計算伺服器單獨存在，或是搜索引擎中的一部分，或是其他伺服器的一部分。應用本申請所示實施例提供的識別中文同義詞的裝置，由於首先確定待識別中文詞之前的最小編輯距離，因而使得同義詞對之間的字詞表達差異不大，在搜索應用中不會給用戶帶來突兀的感覺；再有，本申請實施例利用知識庫對待識別的中文詞進行語義的驗證，即藉由最小粒度類型，更嚴格約束了待識別中文詞的語義屬性，使得識別出的同義詞準確率大大提高，此外，由於利用普義字表對待識別中文詞對中不同的字再次進行驗證，進一步保證了同義詞的識別效果。本申請實施例還提供了一種搜索引擎中的識別中文同義詞的系統，參見圖5’具體包括：識別中文同義詞的裝置501和知識庫儲存單元裝置5〇2 ’其中’ 知識庫儲存裝置502，用於儲存字詞本身’字詞的類型，類型的權重値；識別中文同義詞的裝置5 0 1 ’用於獲得需要識別的任意兩個中文詞；確定所述兩個中文詞之間的最小編輯距離小於等於編輯距離閾値後’判斷所述需要識別的兩個中文 -23- 201124860 詞都存在於預設的知識庫儲存單元5〇2中時，在所述知識庫儲存單元5 02中分別查找每個中文詞的權重最大的最小粒度類型；確定查詢到的每個中文詞的權重最大的最小粒度類型相等時’判定所述兩個中文詞爲同義詞，確定査詢到的每個中文詞的權重最大的最小粒度類型不相等時，判定所述兩個中文詞爲非同義詞。上述識別中文同義詞的裝置50 1，還用於對不能査到的中文詞進行分詞，再判斷所述分詞後的中文詞是否都存在於所述知識庫儲存裝置502中，若是，則再在所述知識庫儲存裝置502中分別查找每個中文詞的權重最大的最小粒度類型，並繼續後續步驟，否則再次執行本步驟。上述系統還包括普義字表儲存裝置5 03，用於儲存可以改變的字和不可以改變的字；上述識別中文同義詞的裝置501，還用於確定兩個中文詞中有變化的字或詞屬於已設置的普義字表中可以改變的字時’判定所述兩個中文詞爲同義詞，確定兩個中文詞中有變化的字或詞不屬於已設置的普義字表中可以改變的字時’判定所述兩個中文詞爲非同義詞》上述識別中文同義詞的系統可以作爲一個伺服器單獨存在’或是搜索引擎中的一部分，或是其他伺服器的一部分。應用本申請所示實施例提供的識別中文同義詞的系統 ’由於首先確定待識別中文詞之前的最小編輯距離，因而使得同義詞對之間的字詞表達差異不大，在搜索應用中不 -24- 201124860 會給用戶帶來突兀的感覺；再有，本申請實施例利用知識庫對待識別的中文詞進行語義的驗證，即藉由最小粒度類型，更嚴格約束了待識別中文詞的語義屬性，使得識別出的同義詞準確率大大提高，此外，由於利用普義字表對待識別中文詞對中不同的字再次進行驗證，進一步保證了同義詞的識別效果。本申請還提供了一種搜索裝置，參見圖7，包括：接收單元701，同義詞查詢單元702，搜索單元703和回饋單元704，其中，接收單元701，用於接收來自用戶的查詢請求，所述查詢請求中包括待查詢詞條；同義詞查詢單元702，用於根據所述待查詢詞條査詢預先設置的同義詞庫，找到該待查詢詞條的同義詞；搜索單元7 0 3 ’用於應用所述待査詢詞條和該待查詢詞條的同義詞進行搜索；回饋單元704，用於將所述搜索結果返回給用戶。應用本實施例提供的搜索裝置，既避免了在搜索中給用戶帶來突兀的感覺’又豐富了搜索結果，使返回的搜索結果更符合用戶的需求。需要說明的是’本申請僅僅以中文爲例描述了如何識別中文同義詞的方法’但’本申請並不限於中文同義詞的識別，對於日文、韓文等其他文字也可利用本申請所述方法進行同義詞的識別’或在本申請所述方法的基礎上稍加修改、等同替換、改進等均可實現同義詞的識別。另外， -25- 201124860 在本文中，諸如第一和第二等之類的關係術語僅僅用來將一個實體或者操作與另一個實體或操作區分開來，而不一定要求或者暗示這些實體或操作之間存在任何這種實際的關係或者順序。而且’術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、物品或者設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括爲這種過程、方法、物品或者設備所固有的要素。爲了描述的方便，以上所述裝置和或系統的各部分以功能分爲各種單元分別描述。當然，在實施本申請時可以把各單元的功能在同一個或多個軟體或硬體中實現。本領域普通技術人員可以理解實現上述方法實施方式中的全部或部分步驟是可以藉由程式來指令相關的硬體來完成’所述的程式可以儲存於電腦可讀取儲存媒體中，這裏所稱的儲存媒體，如：ROM/RAM、磁碟、光碟等。以上所述僅爲本申請的較佳實施例而已，並非用於限定本申請的保護範圍。凡在本申請的精神和原則之內所作的任何修改、等同替換、改進等，均包含在本申請的保護範圍內。【圖式簡單說明】爲了更清楚地說明本申請實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的圖式作簡單地介紹，顯而易見地，下面描述中的圖式僅僅是 -26- 201124860 本申請的一些實施例，對於本領域普通技術人員來講’在不付出創造性勞動性的前提下，還可以根據這些圖式獲得其他的圖式。圖1是根據本申請一個實施例的識別中文同義詞的流程圖；圖2是根據本申請的一較佳實施例的流程圖；圖3是根據本申請實施例的知識庫類型層次示意圖；圖4是根據本申請一個實施例的識別中文同義詞的裝置結構示意圖；圖5是根據本申請一個實施例的識別中文同義詞的系統結構示意圖。圖6是根據本申請一個實施例的一種搜索方法的流程圖；圖7是根據本申請一個實施例的一種搜索裝置的結構示意圖。【主要元件符號說明】 401 :獲取單元 402 :第一判斷單元 403 :第二判斷單元 4 0 4 :查詢單元 405 :第三判斷單元 406 :分詞單元 407:普義字表査詢單元 -27- 201124860 501 :識別中文同義詞的裝置 5 02 :知識庫儲存單元裝置 503 :普義字表儲存裝置 701 :接收單元 702 :同義詞查詢單元 703 :搜索單元 704 :回饋單元 -28201124860 VI. Description of the Invention: [Technical Field] The present application relates to the field of computer data processing technology, and in particular, to a method and apparatus for identifying Chinese synonyms and a method and apparatus for searching therewith. [Prior Art] Existing searches are generally keyword-based searches, in which a user enters a keyword for the search engine to query, and the search engine returns a result page containing the keywords. For example, if the user inputs “digital camera”, the existing Chinese search engine will first segment the input keyword, and usually divide the “digital camera” into two words “digital camera”, and then return the result page containing &quot "Digital" and "camera" are two terms. In fact, different users have different backgrounds and different habits. It is very likely that they have similar intentions and the keywords used for querying are different. For example, the query " The potential intentions of users of digital cameras and digital cameras are exactly the same. For "digital cameras", the results of existing search engines return pages containing "digital" and "camera" entries, and some A very expensive result page, because the words "digits" and "cameras" are not returned or returned due to other technical factors but not in a very high position. If the search engine can find the pair is a group Synonym, at the same time returning the results page of two phrases, then to improve the accuracy of the search, And the user search experience is very effective. -5- 201124860 Synonym is a unique phenomenon in natural language. Synonym control is also a very meaningful work in natural language processing. Its implementation rewrites search results for rich query results. In order to make the user get a good query experience, it is very helpful. However, the synonym replacement involved in the search application must be properly grasped, and it can be solved without using any synonym table. Because the user is used to keyword search, It is accustomed to input the query 'the same word and word in the query are marked red in the result entry; then even the completely synonymous different words and word substitutions are not acceptable to every user. For example: "Potato" and "Potato" It is completely synonymous, but the user enters "potato" and the "potato" appears to be marked red in the result entry. At first glance, it thinks that the search engine has a problem. If it is not marked red, it is easy to be jumped by the user's eyes. So the synonym involved in this article refers to the same thing that should be suitable for search applications. The existing automatic recognition method for Chinese synonyms is to represent each word as a web page, and other words in the dictionary that explain the word form a chain relationship with the word, giving each word a branch, this point値 represents the similarity between words, that is, the relationship between interpretation and interpretation is defined as a super-link, and the page rank (PageRank) is regarded as reflecting the semantics between words. The measure of similarity, and then the synonym is identified according to the size of the semantic similarity. This method is mainly used to measure the synonym by PageRank ,, and PageRank値 is indeed dependent on the resources it can obtain, and this resource There is a great deal of randomness that is difficult to control. For example, for the explanation of "potato", if the resources used focus on explaining the plant characteristics, shape characteristics, etc. of the potatoes, then it is very likely that "potato" -6- 201124860 and "roots" , "ellipse" and other words establish a synonym relationship. Therefore, this PageRank値 which reflects the link relationship is very unreliable, and such unreliable information is difficult to automatically detect, resulting in the inability to accurately identify the required synonym' so that the recognition effect is difficult to guarantee. SUMMARY OF THE INVENTION An embodiment of the present application is to provide a method and apparatus for identifying a Chinese synonym to solve the problem that the Chinese synonym recognition effect cannot be guaranteed. ** Another aspect of the present application is to provide a search method and apparatus. Enrich search results information. An embodiment of the present application provides a method for identifying a Chinese synonym, including: a. calculating a server to obtain any two Chinese words that need to be recognized; b. determining that a minimum edit distance between the two Chinese words is less than or equal to an edit distance threshold. After the step c: c, it is determined whether the two Chinese words that need to be identified exist in the preset knowledge base, and if so, the minimum weight of each Chinese word is found in the knowledge base. Type; d. If the minimum weighted type of each Chinese word that is queried is equal, then the two Chinese words are determined to be synonymous, otherwise the two Chinese words are determined to be non-synonymous. If the two Chinese words that need to be identified do not exist in the preset knowledge base, then further include: 201124860 e, the computing server classifies the Chinese words that cannot be found, and then judges the Chinese after the word segmentation Whether the words are present in the knowledge base, and if so, respectively searching for the smallest granularity type with the largest weight of each Chinese word in the knowledge base, and continuing the subsequent steps e » wherein, when determining each Chinese word After the equal weighted minimum granularity types are equal, the method further includes: calculating, by the server, whether the changed word or word in the two Chinese words belongs to a word that can be changed in the set meaning table, and if so, determining the need The two Chinese words identified are synonymous, otherwise the two Chinese words are determined to be non-synonymous. The knowledge base includes: terms and mourning, each vocabulary or mourning corresponds to at least one type, and each type corresponding to each vocabulary or mourning has a weight 値. The minimum granularity type in which the weight of each Chinese word is found in the knowledge base is the largest: the words or mourning corresponding to each Chinese word are found in the knowledge base, according to each word The bar or mourning corresponds to at least one type, and the weight of each term or mourning, and finds the smallest granularity type with the largest weight of each Chinese word. If the two Chinese words are determined to be synonymous, the identified synonyms are stored in the thesaurus. The embodiment of the present application further provides a search method, including: the search engine receives a query request from a user, The query request includes the to-be-queried entry; -8 - 201124860 The search engine queries the pre-set synonyms database according to the to-be-queried term to find a synonym of the to-be-queried term; the search engine applies the to-be-queried entry and the to-be-queried The synonym of the query term is searched, and the search result including the query to be queried and the synonym of the query to be queried is returned to the user. The embodiment of the present application further provides an apparatus for identifying a Chinese synonym, comprising: an obtaining unit, configured to obtain any two Chinese words that need to be identified; and a first determining unit, configured to determine a minimum edit between the two Chinese words After the distance is less than or equal to the edit distance, the second determining unit is notified: the second determining unit is configured to notify the query unit when the two Chinese words that need to be identified are present in the preset knowledge base; a minimum granularity type for finding the weight of each Chinese word in the knowledge base respectively; a third determining unit, configured to determine that the minimum granularity type of each of the Chinese words that are queried is the largest The two Chinese words are synonymous 'determine that the minimum weighted type of each Chinese word that is queried is not equal.' The two Chinese words are determined to be non-synonymous. The device further includes: a word segmentation unit, configured to segment the Chinese words that cannot be found in the knowledge base, and then notify the second determining unit; the second determining unit is further configured to determine the When the Chinese words after the word segmentation are present in the knowledge base, the query unit is notified, and when it is determined that the Chinese words after the word segmentation are not all present in the knowledge base, the word unit is notified again -9 - 201124860. The device further includes: a general meaning table query unit, configured to determine that the changed word or word in the two Chinese words belongs to a word that can be changed in the set meaning table, and notify the third determining unit to determine The two Chinese words are synonymous, and when it is determined that the changed words or words in the two Chinese words do not belong to the words that can be changed in the set meaning table, the third determining unit is notified that the two Chinese words are Non-synonymous. The knowledge base includes: terms and mourning, each vocabulary or mourning corresponds to at least one type, and each type corresponding to each vocabulary or mourning has a weight 値. The device for identifying a Chinese synonym is a computing server or a search engine. The embodiment of the present application further provides a search device, including: a receiving unit, configured to receive a query request from a user, where the query request includes a query to be queried; a synonym query unit, configured to query according to the query to be queried a synonym of the pre-set query to find a synonym of the to-be-queried term; a search unit for applying the synonym of the to-be-queried term and the to-be-queried term for searching; and a feedback unit for returning the search result to user. The method and apparatus for identifying Chinese synonyms are provided by the embodiment of the present application. Since the minimum editing distance before the Chinese word to be recognized is first determined, the word expression difference between the synonym pairs is not large, and the search result can be improved in the search application. Accuracy, and will not bring a sudden to the user -10- 201124860 Feeling 'again, the embodiment of the present application uses the knowledge base to verify the semantics of the recognized Chinese words, so that the accuracy of the identified synonyms is greatly improved, and the guarantee The recognition effect of synonyms. The search method and device provided by the embodiments of the present application not only avoid the awkward feeling of the user in the search, but also enrich the search result, so that the returned search result is more in line with the user's needs. The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of them. Example. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without creative efforts are within the scope of the present application. The application uses the algorithm of the minimum edit distance, and uses the definition of the smaller edit distance, so that the word expressions between the synonym pairs are not much different, and the user does not feel a sudden feeling in the search application, and also utilizes the knowledge base. The shallow semantic sacred verification makes the synonym accuracy greatly improved, and the extracted synonym table has a good effect on related applications such as search, and can of course be applied to other fields than search. The edit distance refers to the minimum number of basic operations required to change from one string to another, or the sum of the lengths of the difference between the two strings. Common basic operations include inserting a word/word, deleting a word/word, replacing a word/word, or other operations as needed. -11 - 201124860 For example, changing from "I love you" to "I don't love her" requires at least one basic operation to insert "No" and "You" with "She", so the editing distance between the two is 2, For the same reason, the edit distance of "invisible wings" and "good chicken wings" is 3. The process of calculating the edit distance, that is, inserting a word/word, deleting a word/word and replacing a word/word is entirely a prior art. Referring to Figure 1, there is shown a flow chart for identifying Chinese synonyms in accordance with one embodiment of the present application. The purpose of this embodiment is to identify whether the two Chinese words to be recognized are synonymous. The specific steps are as follows: Step 101: The computing server obtains any two Chinese words that need to be recognized; here, usually from a search engine. In order to improve efficiency, you can select the 10000 entries with the most input times in the engine's query log, and compare the 100,000 entries one by one. . The above computing server may be the search engine itself, a server that is synonymous with a specific user, or other servers with computing functions. Step 102: After the calculation server determines that the minimum edit distance between two Chinese words that need to be identified is less than or equal to the edit distance threshold, step 103 is performed; if the minimum edit distance between two Chinese words that need to be identified is greater than the edit distance 闽値, directly determine that two Chinese words are non-synonyms. Here, the edit distance threshold 値 may be 1, 2, 3, or the like. It can be understood that the smaller the edit distance of -12-201124860, the smaller the change between the two words. Step 1 〇 3 ’ The calculation server determines whether the two Chinese words that need to be identified exist in the preset knowledge base, and if so, executes step 104 ♦ wherein the content of the knowledge base is specifically described later. Step 1 04: The computing server searches for the smallest granularity type with the largest weight of each Chinese word in the knowledge base; wherein the weights and granular types of the Chinese words in the knowledge base are introduced later. Step 205: The computing server determines that the two Chinese words are synonymous if the weight of each Chinese word that is queried is the largest and the smallest granularity type is equal, otherwise the two Chinese words are determined to be non-synonyms. It should be noted that if the two Chinese words that need to be identified do not exist in the preset knowledge base, the method further includes: Step 106: The computing server performs word segmentation on the Chinese words that cannot be found, and then determines the word segmentation. Whether the Chinese words are all present in the knowledge base, and if so, the subsequent steps are performed to perform step 104, otherwise step 106 is performed again. It should be noted that the embodiment may further include: storing the recognized synonyms into the thesaurus for later application. The thesaurus can save the recognized synonyms in the form of a data sheet. For a data table that saves synonyms, one possible implementation is to save the terms that are synonymous with each other, which is convenient for querying. For example, the terms A1 and A2, B1 and B2, C1 and C2, C3 are synonymous with each other, and the storage method can be seen in Table 1, -13- 201124860 Table 1 No. query term Synonym 1 Synonym 2 1 A1 A2 2 A2 A1 3 B1 B2 4 B2 B1 5 C1 C2 C3 6 C2 C1 C3 7 C3 C1 C2 Of course, the data sheet for storing synonyms is not limited to the storage method shown in Table 1. 'This article does not limit the specific preservation form of synonyms, as long as it can guarantee the query. Synonyms of the terms can be found in time. The method for recognizing a Chinese synonym provided by the embodiment of the present application first determines the minimum edit distance before the Chinese word to be recognized, so that the word expression difference between the synonym pairs is not large, and the user does not bring the difference in the search application. The awkward feeling, in addition, the embodiment of the present application uses the knowledge base to perform semantic verification on the Chinese words to be recognized, so that the accuracy of the recognized synonyms is greatly improved to ensure the recognition effect of the synonyms. Applying the Chinese synonyms identified in this application can be applied not only to search related fields, but also to other fields. Referring to Figure 2', a flow chart in accordance with a preferred embodiment of the present application. The details are as follows: Step 201: The calculation server obtains any two Chinese words that need to be recognized; -14- 201124860 The commonly selected Chinese word pair to be recognized is a high frequency word that often appears in the log, for example, the number of occurrences is greater than or equal to 20 times. The word, because the high-frequency word is very representative, multiple occurrences ensure that the Chinese word to be recognized is not a strange word; in addition, the number of words of the Chinese word is preferably not more than a certain number, such as the number of words is less than or equal to 8; The editing distance can be calculated quickly, and the longer the term is synonymous. Here, the two Chinese words that need to be identified are from the search engine's query. Step 202: Calculate a minimum edit distance between two Chinese words that the server needs to recognize. Here, the existing dynamic programming algorithm can be used to calculate the minimum editing distance between two Chinese words. Of course, other algorithms can also be used. Here, the specific algorithm for calculating the minimum editing distance is not limited. In the existing dynamic programming algorithm, the minimum unit of each word is one word, such as two words W1, W2; their words are composed of clc2c3, dld2d3, then the shortest distance Dis between clc2c3 and dld2d3 (clc2c3, dld2d3) ) can be derived from the shortest distance of their substrings. The specific calculation method is if c3=d3, Dis(clc2c3, dld2d3)=DIS ( clc2,dld2) +1 ' If not equal, Dis(clc2c3, dld2d3) = Max (Dis(clc2,dld2d3), Dis(clc2c3,dld2)), where Max is the largest 値, which is the dynamic programming algorithm. Step 2:3, the calculation server determines whether the calculated minimum edit distance is less than or equal to the edit distance 闽値, and if so, executes step 2 04; otherwise, it determines that the two Chinese words that need to be recognized are non-synonyms. -15- 201124860 In a preferred embodiment, the edit distance 闽値 is equal to one. Because the synonym identified in this embodiment is mainly used in the query rewriting search application, the query rewriting is simply to replace the keyword entered by the user with its synonym, which can increase the recall rate of the result. And can enrich the result set. For word-based search engines, such as users searching for "infant milk powder", the result containing "baby milk powder" can not be found, and when "infant child" is replaced by the synonym "baby" Can be found. Therefore, if the word is changed after the query is overwritten, it is marked red, even if the rewritten query is not much different from the original query, this is also risky for the user experience, because the user is more accustomed to the key in the result entry. Words are marked red, and larger changes may make some people uncomfortable. Therefore, the recommended edit distance threshold 这里 is equal to 1, which means that if two Chinese words are synonymous, the two Chinese words have little change in expression. Step 204: The calculation server determines whether the two Chinese words that need to be identified are all present in the preset knowledge base, and if yes, executing step 205, otherwise performing step 208; the above knowledge base is actually a dictionary file' may also be called In order to mourn the library, it consists of words and mourning. An entry can be understood as a basic word, and a memorial can be understood as a combination of terms, but this combination is a very fixed combination commonly used in daily life. For example, 'Apple', 'Beijing', and 'University' are each a term, and "Peking University" is a mourning in the knowledge base. The knowledge base itself is a data table 'each-item represents a word -16 - 201124860 , each entry has multiple fields: the word itself, the type of the word, the weight of the type. The above knowledge base has at least one type defined, usually with dozens of types, these types are Hierarchical, each level corresponds to a granularity, that is, since the type is divided into multiple levels, and corresponding to different types of levels, there are multiple granularities. Here, the granularity corresponding to the multiple levels may be referred to as a granular type. Types are already defined attributes, which are defined with reference to linguistic knowledge, all words are classified in advance by their respective types. For example, see FIG. 3, which is a knowledge base type according to an embodiment of the present application. Hierarchical diagram. . In the present embodiment, the "product" belongs to a higher-level type, which is referred to as the first-level type in the present embodiment, and "quot; product-brand", "product-model", "product-specification" "Product-type" is a different type under the "product" level type, β卩 "product-brand", "product-model", "product-specification", "product-type" is the second level type, It is located under the first level type. Under the "product-type", it can also include the "product type-simple", "product type-composite", "product type-collective", etc. of the third level type. In this embodiment , in the third level type "product type - simple", "product type - compound", "product type - collective name" is the smallest granular type. Each entry or mourning in the knowledge base will correspond to at least one hierarchical type For example, "Apple" belongs not only to "product type--simple" but also to "plant" type, while "car" belongs only to "product type--common name", and each term or mourning has weights, The weight 値 indicates that the -17- 201124860 entry or mourning belongs to the type of , rate, for example, "Apple" belongs to "product type - simple weight 値 is 0. 38, and the weight of the “plant” is 0 · 5 4 . It can be understood that the types of types, the types of types in the above knowledge base, and the weights of the types of a certain entry or mourning are obtained through empirical accumulation. The so-called experience accumulation means that the types and types of levels in the knowledge base are obtained by referring to the knowledge of linguistics, and the weight of each word is counted in the web resources, such as the word "apple". It appears 60 times in the webpage as the meaning of the computer product, and the number of occurrences of the plant is 40 times, then the weights of the "product type-simple" and "plant" are 〇·6 and 0·4, respectively. Step 2:5, the computing server searches for the minimum granularity type with the largest weight of each Chinese word in the knowledge base; it can be understood that each term or mourning in the knowledge base corresponds to at least one type, and each The term or mourning has weights, so the smallest granularity type with the largest weight of each Chinese word can be found. Step 2 0 6 ' The calculation server judges whether the minimum granularity type with the largest weight of each Chinese word is equal. If they are equal, step 2 07 is performed; otherwise, the two Chinese words that need to be identified are non-synonymous. It can be understood that with the minimum granularity type, the semantic attributes of the Chinese words to be recognized are more strictly constrained, and the reliability of the identified synonyms is guaranteed. Step 2 〇7, the calculation server determines whether the word or word with the change in the two Chinese words belongs to the word that can be changed in the set meaning table, and if so, determines that the two Chinese words that need to be recognized are synonyms, otherwise Judging needs to know -18- 201124860 Two other Chinese words are non-synonymous. The Puyi word list and the knowledge base are similar 'as a text file, and each line represents a plain word. The syllabary includes the words that can be changed and the words that cannot be changed. Among them, the changeable words are mostly the last code words of multi-words, and these tail code words appear frequently, such as "machine" '"器" Words; unchangeable words are mostly the first or last code words of a word, and most of them will be escaping words such as "no", "not", "half" and so on. The vocabulary is also accumulated by experience or by manual review. It can be understood that the recognition effect of synonyms is further ensured due to the existence of the universal vocabulary. Step 208: The computing server performs segmentation on the Chinese words that cannot be found. Here, the Chinese words that cannot be found may be two Chinese words that need to be identified, or any one of them, or are obtained after the word segmentation has been processed. Chinese word. Step 209: The calculation server determines whether the Chinese words after the word segmentation are present in the knowledge base, and if yes, returns to step 205; otherwise, performs step 2 0 8 again. The above method for identifying Chinese synonyms can be applied to a search engine or to other servers or devices that require an application. It should be noted that the embodiment may further include: storing the recognized synonyms into the thesaurus for later application. The thesaurus can save the recognized synonyms in the form of a data sheet. For the data sheet of the -19-201124860 synonym, one possible implementation is to make the terms of the two synonyms are collated, which is convenient for the query. This article does not limit the specific preservation of synonyms, as long as it can ensure that the synonyms of the terms to be queried can be found in time. Applying the method for recognizing Chinese synonyms provided by the embodiment shown in FIG. 2, since the minimum edit distance before the Chinese word to be recognized is first determined, the word expression difference between the synonym pairs is not large, and the user is not given in the search application. In addition, the embodiment of the present application uses the knowledge base to perform semantic verification on the Chinese words to be recognized, that is, by using the minimum granularity type, the semantic attributes of the Chinese words to be recognized are more strictly restricted, so that the recognized synonyms are obtained. The accuracy rate is greatly improved. In addition, since the different words in the Chinese word pair are verified again by using the Puyi word table, the recognition effect of the synonym is further ensured. For the identified thesaurus, the following application can be applied: After the search engine receives the to-be-queried entry input by the user, the synonym of the to-be-queried term is found from the thesaurus, and then the search engine applies the user-entered query. The term and the synonym of the term respectively perform a search query, and all the results of the two queries are returned to the user, thereby avoiding the awkward feeling of the user in the search application, enriching the search result, and returning the search result. More in line with the needs of users, so it can be applied to search engine query rewrite applications. For example, if a user queries a "digital camera" by a search engine, the search engine learns that the "digital camera" is a synonym for "digital camera" by querying the thesaurus, and the search engine uses the "digital ♦ camera" and the "digital camera" respectively. Search for queries and return search results containing -20- 201124860 "Digital Camera" and "Digital Camera", which not only enriches the search results, but also avoids the information that users need to be missed. 0 Based on the above application, this application also A search method is provided. Referring to FIG. 6, the method specifically includes: Step 601: A user inputs a query to be queried, and submits a query request to a search engine. Step 602: After receiving a query request from a user that includes a query to be queried, the search engine Searching for a synonym of the to-be-queried term by the query-to-query term, and searching for a synonym of the to-be-queried term; the search engine applies the to-be-queried term and the synonym of the to-be-queried term to search; Step 6〇4, The search engine returns including the to-be-queried term and the to-be-queried term synonym The search result is provided to the user. The embodiment of the present application further provides an apparatus for identifying a Chinese synonym. Referring to FIG. 4, the method further includes: an obtaining unit 401, a first determining unit 402, a second determining unit 403, a query unit 404, and a third determining The unit 405 is configured to obtain any two Chinese words that need to be identified. In the application of the search engine, the search query log stores keywords that are frequently queried by the user, and thus any two characters that need to be identified are obtained. Among the Chinese words, one is from the keyword input by the user in the search engine, and the other is obtained from the search log of the search according to the keyword input by the user. The first determining unit 402 is configured to determine the two Chinese characters. Between 21 and 201124860, the minimum editing distance is less than or equal to the editing distance threshold ,, and the second determining unit 403 is notified; the second determining unit 403 is configured to determine that the two Chinese words that need to be identified are present in the preset knowledge. In the library, the notification unit 404 is used; the query unit 404 is configured to separately search for the weight of each Chinese word in the knowledge base. a large minimum granularity type; a third determining unit 405, configured to determine that the minimum weighted type of each of the Chinese words that are queried is equal, and determine that the two Chinese words are synonyms, and determine each Chinese word that is queried The two Chinese words are not synonymous when the maximum weighted type of the weight is not equal. The device may further include: a word segmentation unit 406, configured to perform word segmentation on Chinese words that cannot be found in the knowledge base. Then, the second determining unit 403 is notified; the second determining unit 403 is further configured to: when the Chinese words after the word segmentation are found to exist in the knowledge base, notify the query unit 404 to determine the Chinese after the word segmentation. When the words are not all present in the knowledge base, the word segmentation unit 406 is notified. The device may further include: a grammar table query unit 407, configured to notify the third determining unit 405 to determine when the word or word that has changed among the two Chinese words belongs to a word that can be changed in the set syllable word table. The two Chinese words are synonymous, and when it is determined that the changed words or words in the two Chinese words do not belong to the words that can be changed in the set meaning table, the third determining unit 405 is notified to determine the two Chinese words. It is not synonymous. -22- 201124860 The above knowledge base includes: terms and mourning, each term or mourning corresponds to at least one type, and each type corresponding to each vocabulary or mourning has a weight 値. The above knowledge base and general word list are obtained through experience accumulation. The above device for identifying Chinese synonyms can exist as a computing server alone, or as part of a search engine, or as part of other servers. Applying the device for identifying Chinese synonyms provided by the embodiment shown in the present application, since the minimum edit distance before the Chinese word to be recognized is first determined, the word expression difference between the synonym pairs is not large, and the user is not given in the search application. In addition, the embodiment of the present application uses the knowledge base to perform semantic verification on the Chinese words to be recognized, that is, by using the minimum granularity type, the semantic attributes of the Chinese words to be recognized are more strictly restricted, so that the recognized synonyms are obtained. The accuracy rate is greatly improved. In addition, since the different words in the Chinese word pair are verified again by using the Puyi word table, the recognition effect of the synonym is further ensured. The embodiment of the present application further provides a system for identifying a Chinese synonym in a search engine. Referring to FIG. 5 ′, the device specifically includes: a device 501 for identifying a Chinese synonym and a knowledge base storage unit device 5 〇 2 'where the knowledge base storage device 502 is used. For storing the word itself, the type of the word, the weight of the type 値; the device for identifying the Chinese synonym 5 0 1 'for obtaining any two Chinese words that need to be recognized; determining the minimum editing distance between the two Chinese words If the two Chinese -23-201124860 words that need to be identified are present in the preset knowledge base storage unit 5〇2, the search is performed in the knowledge base storage unit 502, respectively. The smallest granularity type with the largest weight of Chinese words; when the minimum granularity type with the largest weight of each Chinese word is equal, the two Chinese words are synonymous, and the weight of each Chinese word is the largest. When the minimum granularity types are not equal, it is determined that the two Chinese words are non-synonymous. The device 501 for identifying a Chinese synonym is further configured to perform word segmentation on a Chinese word that cannot be found, and then determine whether the Chinese word after the word segmentation exists in the knowledge base storage device 502, and if so, then The knowledge base storage device 502 separately searches for the smallest granularity type with the largest weight of each Chinese word, and continues the subsequent steps, otherwise the step is performed again. The above system further includes a syllabus storage device 503 for storing words that can be changed and words that cannot be changed; the device 501 for identifying Chinese synonyms is also used to determine words or words that have changed among two Chinese words. When the words that can be changed in the set of universal words are set, the two Chinese words are synonymous, and it is determined that the words or words that have changed among the two Chinese words are not in the set meaning table. When the word is 'determined, the two Chinese words are non-synonymous'. The above system for identifying Chinese synonyms can exist as a server alone or as part of a search engine or as part of another server. Applying the system for identifying Chinese synonyms provided by the embodiment shown in the present application 'Because the minimum edit distance before the Chinese word to be recognized is first determined, the word expression difference between the synonym pairs is not large, and is not in the search application. 201124860 will bring a sudden feeling to the user; furthermore, the embodiment of the present application uses the knowledge base to perform semantic verification on the Chinese word to be recognized, that is, by using the minimum granularity type, the semantic attribute of the Chinese word to be recognized is more strictly restricted, so that The accuracy of the identified synonyms is greatly improved. In addition, since the different words in the Chinese word pair are used to verify the words again, the recognition effect of the synonyms is further ensured. The present application further provides a search device, as shown in FIG. 7, comprising: a receiving unit 701, a synonym query unit 702, a search unit 703 and a feedback unit 704, wherein the receiving unit 701 is configured to receive a query request from a user, the query The request includes a query to be queried; the synonym query unit 702 is configured to query a pre-established thesaurus according to the to-be-queried entry to find a synonym of the to-be-queried term; the search unit 7 0 3 ' is used to apply the to-be-queried The query term and the synonym of the query query are searched; and the feedback unit 704 is configured to return the search result to the user. Applying the search device provided in this embodiment avoids the awkward feeling of the user in the search, and enriches the search result, so that the returned search result is more in line with the user's needs. It should be noted that 'this application only uses Chinese as an example to describe how to identify Chinese synonyms. 'But' this application is not limited to the recognition of Chinese synonyms. For Japanese, Korean and other texts, the methods described in this application can also be used for synonyms. The identification of the synonym can be achieved by a slight modification, equivalent replacement, improvement, etc., based on the method described herein. In addition, -25- 201124860 In this document, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations. There is any such actual relationship or order between them. Furthermore, the term 'comprising', "comprises" or "comprising" or any other variations thereof is intended to cover a non-exclusive inclusion, such that a process, method, article, or device that comprises a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such a process, method, item, or device. For convenience of description, the above-described devices and/or portions of the system are separately described in terms of functions into various units. Of course, the functions of each unit can be implemented in the same software or hardware or hardware in the implementation of the present application. A person skilled in the art can understand that all or part of the steps in implementing the foregoing method embodiments can be implemented by the program to instruct the related hardware to be completed. The program can be stored in a computer readable storage medium, as referred to herein. Storage media, such as: ROM / RAM, disk, CD, etc. The above description is only the preferred embodiment of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this application are included in the scope of the present application. BRIEF DESCRIPTION OF THE DRAWINGS In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, in the following description The drawings are only -26-201124860. Some embodiments of the present application, for those skilled in the art, can obtain other patterns according to these drawings without any creative labor. 1 is a flow chart for identifying a Chinese synonym according to an embodiment of the present application; FIG. 2 is a flow chart according to a preferred embodiment of the present application; FIG. 3 is a schematic diagram of a knowledge base type hierarchy according to an embodiment of the present application; A schematic diagram of a device structure for identifying a Chinese synonym according to an embodiment of the present application; FIG. 5 is a schematic structural diagram of a system for identifying a Chinese synonym according to an embodiment of the present application. FIG. 6 is a flow chart of a search method according to an embodiment of the present application; FIG. 7 is a schematic structural diagram of a search device according to an embodiment of the present application. [Main component symbol description] 401: acquisition unit 402: first determination unit 403: second determination unit 4 0 4: inquiry unit 405: third determination unit 406: word segmentation unit 407: general meaning table inquiry unit -27- 201124860 501 : means for identifying Chinese synonyms 5 02 : knowledge base storage unit device 503 : normal word table storage device 701 : receiving unit 702 : synonym query unit 703 : search unit 704 : feedback unit -28

Claims

201124860 VII. Patent application scope: 1. A method for identifying Chinese synonyms, comprising: a. calculating a server to obtain any two Chinese words that need to be recognized; b, determining a minimum between the two Chinese words After the edit distance is less than or equal to the edit distance, step c is executed; c. Determine whether the two Chinese words that need to be recognized exist in the preset knowledge base. If yes, find each Chinese word in the knowledge base. The smallest granularity type with the largest weight; and d, if the minimum weighted type of each Chinese word that is queried is the same, then the two Chinese words are synonymous, otherwise the two Chinese words are determined to be non-synonymous 2 According to the method of claim 1, wherein if the two Chinese words that need to be identified do not exist in the preset knowledge base, further comprising: e. calculating the word segmentation of the Chinese words that cannot be found by the server And then determine whether the Chinese words after the word segment are present in the knowledge base, and if so, then each Chinese word is searched in the knowledge base. The minimum particle size of the heaviest type 'and continue with the subsequent steps. 3. The method according to claim 1 or 2, wherein, when it is determined that the minimum granularity type of each Chinese word has the largest weight, the further comprising: calculating the server to determine that there is a change in the two Chinese words Whether the word or word belongs to the word that can be changed in the set plain word list, and if so, it is determined that the two Chinese words that need to be recognized are synonymous, otherwise the two Chinese words are judged to be non-synonyms of -29-201124860. 4. According to the method of claim 1, wherein the knowledge base includes: entries and mournings 'each term or mourning at least one type, and each type of vocabulary or mourning corresponds to each type Have weights. According to the method of claim 4, wherein the minimum granularity type in which the weight of each Chinese word is found in the knowledge base is the largest: the word corresponding to each Chinese word is found in the knowledge base. Article or mourning, according to at least one type of each entry or mourning, and the weight of each vocabulary or mourning 値 'to find the smallest granularity type with the largest weight of each Chinese word. 6. The method of claim 1, wherein if the two Chinese words are determined to be synonymous, the recognized synonym is stored in a synonym database. a method for performing a search, which utilizes the method of claim 6 of the patent scope, characterized in that: the search engine receives a query request from a user, the query request includes an entry to be queried; the search engine according to the The query term queries the pre-set synonyms database to find the synonym of the to-be-queried term; and the search engine applies the to-be-queried term and the synonym of the to-be-queried term to search for 'returns including the to-be-queried term and The search result of the synonym of the query to be queried is given to the user. 8. A device for recognizing a Chinese synonym, comprising: -30- 201124860 obtaining unit for obtaining any two Chinese words to be recognized; a first determining unit for determining between the two Chinese words Notifying the second determining unit after the minimum edit distance is less than or equal to the edit distance threshold; the second determining unit is configured to notify the query unit when the two Chinese words that need to be identified are present in the preset knowledge base; a minimum granularity type for finding the weight of each Chinese word in the knowledge base, and a third determining unit, configured to determine that the minimum granularity type of each Chinese word having the largest weight is equal, and determining the The two Chinese words are synonymous, and it is determined that the two Chinese words are non-synonymous when the minimum granularity type with the largest weight of each Chinese word is not equal. 9. The device according to claim 8, wherein the device further comprises: a word segmentation unit for segmenting a Chinese word that cannot be found in the knowledge base, and then notifying the second determining unit: the second determining The unit is further configured to determine that the Chinese words after the word segment are present in the knowledge base, and then notify the query unit to determine that the Chinese words after the word segment are not present in the knowledge base, and then notify the word segmentation unit. 10. The device according to claim 8 or 9, wherein the device further comprises: a universal word table query unit, configured to determine that a word or word having a change in the two Chinese words belongs to a set of a universal word list When the word can be changed, the third judging unit is notified that the two Chinese words are synonymous, and it is determined that the words or words that have changed among the two Chinese words do not belong to the set of predicate words that can be changed -31 - 201124860 When the word is 'notified, the third judgment unit determines that the two Chinese words are non-synonymous. 1 1. The device according to item 8 of the patent application, wherein the knowledge base comprises: a term and a mourning, each term or mourning corresponding to at least one type 'and each term or mourning corresponding to each Each type has a weight 値. The device according to item 8 of the patent application, wherein the device for identifying a Chinese synonym is a computing server or a search engine. A search device for searching, which utilizes the method of claim 7 and characterized in that it comprises: a receiving unit, configured to receive a query request from a user, wherein the query request includes a query to be queried And a synonym query unit, configured to query a pre-established thesaurus according to the to-be-queried entry to find a synonym of the to-be-queried term; and a search unit, configured to apply the to-be-queried term and the synonym of the to-be-queried term to search And a feedback unit for returning the search result to the user. -32-