TWI681304B - System and method for adaptively adjusting related search words - Google Patents

System and method for adaptively adjusting related search words Download PDF

Info

Publication number
TWI681304B
TWI681304B TW107145181A TW107145181A TWI681304B TW I681304 B TWI681304 B TW I681304B TW 107145181 A TW107145181 A TW 107145181A TW 107145181 A TW107145181 A TW 107145181A TW I681304 B TWI681304 B TW I681304B
Authority
TW
Taiwan
Prior art keywords
search
word
text
related word
threshold
Prior art date
Application number
TW107145181A
Other languages
Chinese (zh)
Other versions
TW202022635A (en
Inventor
沈民新
Original Assignee
財團法人工業技術研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 財團法人工業技術研究院 filed Critical 財團法人工業技術研究院
Priority to TW107145181A priority Critical patent/TWI681304B/en
Priority to CN201910088844.9A priority patent/CN111324705B/en
Application granted granted Critical
Publication of TWI681304B publication Critical patent/TWI681304B/en
Publication of TW202022635A publication Critical patent/TW202022635A/en

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system for adaptively adjusting related search words includes an input device, a search log collection module, a threshold setting module, and a process evolution module. The input device is configured to receive a search term. The search log collection module is configured to determine whether a cumulative number of search words in historical search logs is greater than a first threshold or less than a second threshold. The threshold setting module is configured to set the first and second threshold in terms of the number of search words in the search logs . When the cumulative number of search words is between the first threshold and the second threshold, the process evolution module further optimizes the middle search process to find out at least one related word in the indexed text and/or at least one historical search word in the historical search logs that are most relevant to the attributes or content of the search word.

Description

自適應性調整關連搜尋詞的系統及其方法System and method for adaptively adjusting related search words

本發明是有關於一種自適應性調整關連搜尋詞的系統及其方法。The invention relates to a system and method for adaptively adjusting related search words.

現代搜尋系統通常會在搜尋結果中同時回饋給用戶和搜尋詞相關的其它搜尋詞,用以協助用戶快速釐清查詢目標,其原因在於用戶使用的搜尋關鍵詞常常無法以簡短的詞彙精確地描述其搜尋意圖,或是用戶給定的搜尋詞或搜尋目標有多種描述方式或岐義性而造成用戶與文本間的詞彙用語不匹配,或是用戶對於搜尋目標事物的理解或知識不足而誤用錯誤的搜尋詞,又或是用戶打字錯誤如同音或近音等等因素。一般而言,關連搜尋詞的擷取技術可根據資料來源區分為基於索引文本內容的方法和基於歷史查詢記錄的方法。以文本為基礎的方法在搜尋系統上線前期,立即可以根據索引文本內容中詞彙之間的相關分析提供關連搜尋詞的建議清單,但其缺點是僅能根據固定的文本內容提供建議,無法根據後期累積的歷史查詢記錄分析預測用戶的搜尋意圖。雖然基於歷史查詢記錄的方法則可以根據不斷累積的用戶資料提供最新的搜尋意圖預測,從而得到更好的關連搜尋詞的建議清單,但卻無法在系統前期立刻提供建議,需要經過長時間的用戶使用,才能累積數量足夠的分析資料來源。習知方法中亦有利用權重整合方法結合上述兩種方法,使得不論在搜尋系統的前期沒有用戶歷史資料階段以或是後期累積足夠歷史資料階段,均可以推薦關連搜尋詞。Modern search systems usually give back to users other search terms related to the search term in the search results, to help users quickly clarify the query target, because the search keywords used by users often cannot accurately describe them in short vocabulary The search intention, or the user's given search term or search target has multiple descriptions or ambiguities that cause the vocabulary between the user and the text to not match, or the user's understanding or knowledge of the search target is insufficient and misused Search terms, or user typing errors such as sound or near sound and other factors. In general, the related search term extraction technology can be divided into a method based on index text content and a method based on historical query records according to the data source. The text-based method is used in the early stage of the search system, and it can immediately provide a list of related search terms based on the relevant analysis of the vocabulary in the index text content, but its disadvantage is that it can only provide suggestions based on the fixed text content, and cannot be based on the later stage. The accumulated historical query records analyze and predict the user's search intentions. Although the method based on historical query records can provide the latest search intent prediction based on the accumulated user data, so as to obtain a better list of related search terms, but it cannot provide suggestions immediately in the early stage of the system, and it takes a long time for users Use only in order to accumulate enough analysis data sources. In the conventional method, the weight integration method is also used to combine the above two methods, so that whether there is no user historical data in the early stage of the search system or when there is sufficient historical data in the later stage, related search terms can be recommended.

然而,權重整合方法同樣有權重組合的資料來源問題,人工設定往往無法達到最佳效果,通常需要累積足夠的搜尋記錄資料,才能以統計模型或機器學習方式訓練得到第一組最佳權重組合,並且仍有不同垂直領域的轉移學習的困難問題。因此, 上述擷取技術分別適用於不同上線時期的搜尋系統,由於搜尋記錄多寡不同,因而無法隨時提供適合建議用戶的關連搜尋詞,有必要提出改進之道。However, the weight integration method also has the problem of weighted combination of data sources. Manual setting often fails to achieve the best results. Usually, it is necessary to accumulate enough search record data to train the first set of optimal weight combinations by statistical model or machine learning. And there are still difficulties in transfer learning in different vertical fields. Therefore, the above-mentioned extraction techniques are respectively applicable to search systems in different online periods. Due to the different search records, it is not possible to provide related search terms suitable for suggesting users at any time.

本發明係有關於一種自適應性調整關連搜尋詞的系統及其方法,可根據系統累積的搜尋記錄的數量自我調整關連搜尋詞,以提供適合建議用戶的關連搜尋詞。The invention relates to a system and method for adaptively adjusting related search words, which can self-adjust related search words according to the number of search records accumulated by the system to provide related search words suitable for suggesting users.

根據本發明之一方面,提出一種自適應性調整關連搜尋詞的系統,包括一輸入裝置、一記錄蒐集模組、一門檻值設定模組以及一演化模組。輸入裝置用以接收用戶輸入並產出一搜尋詞。記錄蒐集模組用以判斷搜尋詞的累計搜尋次數是否大於一第一門檻值或小於一第二門檻值。門檻值設定模組用以設定滿足第一或第二門檻值的搜尋記錄的數量。演化模組用以根據搜尋記錄的數量多寡調整一搜尋流程,其中當搜尋詞的累計搜尋次數大於第一門檻值時,演化模組根據一歷史搜尋記錄找出與搜尋詞的內容或屬性相關的至少一歷史搜尋詞。當搜尋詞的累計搜尋次數小於第二門檻值時,演化模組執行一初期搜尋流程,以找出一文本中與搜尋詞的內容或屬性相關的至少一關連詞。當搜尋詞的累計搜尋次數介於第一門檻值與第二門檻值之間時,演化模組對中期搜尋流程進行優化,以進一步找出文本中及歷史搜尋記錄中與搜尋詞的內容或屬性相關最大化的至少一關連詞及/或至少一歷史搜尋詞。According to one aspect of the present invention, a system for adaptively adjusting related search terms is provided, including an input device, a record collection module, a threshold setting module, and an evolution module. The input device is used to receive user input and generate a search term. The record collection module is used to determine whether the accumulated search times of the search term is greater than a first threshold or less than a second threshold. The threshold setting module is used to set the number of search records that satisfy the first or second threshold. The evolution module is used to adjust a search process according to the number of search records. When the cumulative number of search terms for the search term is greater than the first threshold, the evolution module finds the content or attributes related to the search term based on a historical search record At least one historical search term. When the cumulative number of search terms for the search term is less than the second threshold, the evolution module performs an initial search process to find at least one related word in a text related to the content or attribute of the search term. When the cumulative number of search terms for the search term is between the first threshold and the second threshold, the evolution module optimizes the mid-term search process to further find the content or attributes of the search term in the text and historical search records At least one related word and/or at least one historical search word that maximizes relevance.

根據本發明之一方面,提出一種自適應性調整關連搜尋詞的方法,包括下列步驟。輸入流程用以接收用戶輸入並產出一搜尋詞。記錄蒐集流程用以判斷搜尋詞的累計搜尋次數是否大於一第一門檻值或小於一第二門檻值。門檻值設定流程用以設定滿足第一或第二門檻值的搜尋記錄的數量。演化流程,用以根據搜尋記錄的數量多寡調整一搜尋流程,其中當搜尋詞的累計搜尋次數大於第一門檻值時,演化流程根據一歷史搜尋記錄找出與搜尋詞的內容或屬性相關的至少一歷史搜尋詞。當搜尋詞的累計搜尋次數小於第二門檻值時,演化流程執行一初期搜尋流程,以找出一文本中與搜尋詞的內容或屬性相關的至少一關連詞。當搜尋詞的累計搜尋次數介於第一門檻值與第二門檻值之間時,演化流程對中期搜尋流程進行優化,以進一步找出文本中及歷史搜尋記錄中與搜尋詞的內容或屬性相關最大化的至少一關連詞及/或至少一歷史搜尋詞。According to one aspect of the present invention, a method for adaptively adjusting related search terms is proposed, including the following steps. The input process is used to receive user input and generate a search term. The record collection process is used to determine whether the cumulative number of search terms for a search term is greater than a first threshold or less than a second threshold. The threshold setting process is used to set the number of search records that satisfy the first or second threshold. The evolution process is used to adjust a search process according to the number of search records. When the cumulative number of search terms for the search term is greater than the first threshold, the evolution process finds at least the content or attribute related to the search term based on a historical search record A historical search term. When the cumulative search frequency of the search term is less than the second threshold, the evolution process performs an initial search process to find at least one related word in a text related to the content or attribute of the search term. When the cumulative number of search terms for the search term is between the first threshold and the second threshold, the evolution process optimizes the mid-term search process to further find the content or attributes related to the search term in the text and historical search records Maximized at least one related word and/or at least one historical search word.

為了對本發明之上述及其他方面有更佳的瞭解,下文特舉實施例,並配合所附圖式詳細說明如下:In order to have a better understanding of the above and other aspects of the present invention, the following examples are specifically described in conjunction with the accompanying drawings as follows:

以下係提出實施例進行詳細說明,實施例僅用以作為範例說明,並非用以限縮本發明欲保護之範圍。以下是以相同/類似的符號表示相同/類似的元件做說明。以下實施例中所提到的方向用語,例如:上、下、左、右、前或後等,僅是參考所附圖式的方向。因此,使用的方向用語是用來說明並非用來限制本發明。The following is an example for detailed description. The example is only used as an example, not intended to limit the scope of the present invention. The following description uses the same/similar symbols to indicate the same/similar components. Directional terms mentioned in the following embodiments, for example: up, down, left, right, front or back, etc., are only directions referring to the drawings. Therefore, the directional terminology is used to illustrate rather than limit the invention.

依照本發明之一實施例,提出一種自適應性調整關連搜尋詞的系統,例如是具有自我調整搜尋流程的搜尋引擎。對於初期導入本系統的搜尋引擎而言,在未累積足夠數量的搜尋記錄之前,本系統可在初期根據已建立索引的文本及索引詞表,比對出文本中與搜尋詞的文字內容或特徵屬性相關的至少一關連詞,以建立一初期的關連搜尋詞表。接著,在中期累積一定數量的搜尋記錄之後,本系統可根據一定數量的歷史搜尋記錄以及初期已建立索引的文本,比對出文本中及歷史搜尋記錄中與搜尋詞的內容或屬性相關最大化的至少一關連詞及/或至少一歷史搜尋詞,以建立一中期的關連搜尋詞表。之後,在後期累積足夠數量的搜尋記錄之後,本系統可直接根據用戶輸入的搜尋詞,找出與搜尋詞的內容或屬性相關的至少一歷史搜尋詞,以建立一後期的關連搜尋詞表。According to an embodiment of the present invention, a system for adaptively adjusting related search terms is provided, such as a search engine with a self-adjusting search process. For the search engine imported into the system in the early stage, before a sufficient number of search records have been accumulated, the system can compare the text content or characteristics of the search word in the text based on the indexed text and the index vocabulary in the initial stage At least one related word related to the attribute to establish an initial related search word list. Then, after accumulating a certain number of search records in the mid-term, the system can maximize the correlation between the content and attributes of the search term in the text and the historical search records based on a certain number of historical search records and texts that have been indexed in the early stage At least one related word and/or at least one historical search word to create a mid-term related search word list. After that, after accumulating a sufficient number of search records in the later period, the system can directly find at least one historical search term related to the content or attribute of the search term based on the search term input by the user, to establish a related search term list in the later period.

由上述可知,本系統可根據不同時期所累積的搜尋記錄的數量來達到自我優化的功能,使其演化模組可順利由前期無用戶行為記錄(搜尋記錄)的階段演進至後期以用戶行為記錄(搜尋記錄)為主的階段,進而提供適合建議用戶的關連搜尋詞。As can be seen from the above, the system can achieve self-optimization according to the number of search records accumulated in different periods, so that its evolution module can smoothly evolve from the stage of no user behavior records (search records) in the early stage to the later user behavior records (Search history) Main stage, and then provide related search terms suitable for suggesting users.

請參照第1圖,依照本發明之一實施例,自適應性調整關連搜尋詞的系統100包括一輸入裝置110、一記錄蒐集模組120、一門檻值設定模組130以及一演化模組140。輸入裝置110用以接收用戶輸入並產出一搜尋詞112。記錄蒐集模組120用以判斷搜尋詞112的累計搜尋次數是否大於一第一門檻值或小於一第二門檻值(以門檻值132表示)。門檻值設定模組130用以設定滿足第一或第二門檻值的搜尋記錄的數量。此外,演化模組140用以根據搜尋記錄的數量多寡調整一搜尋流程。Referring to FIG. 1, according to an embodiment of the present invention, a system 100 for adaptively adjusting related search terms includes an input device 110, a record collection module 120, a threshold setting module 130, and an evolution module 140 . The input device 110 is used to receive user input and generate a search term 112. The record collection module 120 is used to determine whether the accumulated search times of the search term 112 is greater than a first threshold value or less than a second threshold value (represented by the threshold value 132). The threshold setting module 130 is used to set the number of search records that satisfy the first or second threshold. In addition, the evolution module 140 is used to adjust a search process according to the number of search records.

在一實施例中,輸入裝置110可為一使用者介面,用以讀取用戶輸入的資料,包括文字、符號及/或語音等。以電腦或遠端伺服器為例,輸入裝置110可為連接至電腦或遠端伺服器的手持電子裝置,本發明不以此為限,輸入裝置110可將用戶欲檢索的搜尋詞112輸入至電腦或遠端伺服器中,再透過導入本系統100的搜尋引擎102尋找線上或本地文本資料庫的資料。資料庫可包含記錄資料庫124及文本資料庫126。文本資料庫126用以儲存欲搜尋的文本114的來源,包括文本檔案及/或資料庫欄位:文本檔案例如產品說明書檔案、廣告文案檔案、產品測試報告檔案、網頁檔案等;資料庫欄位例如商品資料庫的資料欄位,資料欄位例如商品名稱、關鍵字、商品描述、品牌等。記錄資料庫124用以儲存用戶的歷史搜尋記錄126。In one embodiment, the input device 110 may be a user interface for reading user input data, including text, symbols, and/or voice. Taking a computer or a remote server as an example, the input device 110 may be a handheld electronic device connected to the computer or the remote server. The present invention is not limited to this. The input device 110 may input the search term 112 to be searched by the user to In a computer or a remote server, the search engine 102 imported into the system 100 is used to search for data in an online or local text database. The database may include a record database 124 and a text database 126. The text database 126 is used to store the source of the text 114 to be searched, including text files and/or database fields: text files such as product manual files, advertisement copy files, product test report files, web page files, etc.; database fields For example, the data fields of the product database, such as the product name, keywords, product description, brand, etc. The record database 124 is used to store the user's historical search records 126.

記錄蒐集模組120用以蒐集用戶對本系統100之操作內容,包括輸入搜尋詞、點擊位置、點擊次數、瀏覽時間等資訊,以及各搜尋詞112的內容或屬性。記錄蒐集模組120將上述資料蒐集完成後即成為歷史搜尋記錄126,並進一步儲存至記錄資料庫124。搜尋詞112的內容或屬性可為產品中文名稱、英文名稱、簡稱、廠牌、型號、功能及其他廠牌的名稱等,本發明不以此為限,搜尋詞112的內容或屬性可根據辭典中的詞義或使用者自訂的語意或人工編輯的開放資料(如Wikipedia、DBpedia、Open Directory Project)或統計式專有名詞辨識(Name Entity Recognition)等方式來決定。當搜尋詞112的內容或屬性決定之後,本系統100再根據搜尋詞112的內容或屬性尋找相關的關連詞148。The record collection module 120 is used to collect user operations on the system 100, including inputting search terms, click locations, clicks, browsing time and other information, as well as the content or attributes of each search term 112. After the record collection module 120 collects the above data, it becomes the historical search record 126 and further stores it in the record database 124. The content or attribute of the search word 112 may be the product's Chinese name, English name, abbreviation, brand, model, function, and other brand names, etc. The present invention is not limited to this. The content or attribute of the search word 112 may be based on the dictionary The meanings in the words or user-defined semantics or manually edited open data (such as Wikipedia, DBpedia, Open Directory Project) or statistical proper noun recognition (Name Entity Recognition) are determined. After the content or attribute of the search word 112 is determined, the system 100 then searches for related related words 148 according to the content or attribute of the search word 112.

另外,本系統100還可透過搜尋引擎102對搜尋詞112的解析及語法重建,過濾文本114及/或歷史搜尋記錄126中與搜尋詞112的內容或屬性不相關的詞彙,以確保資料擷取的正確性與周延性。In addition, the system 100 can also use the search engine 102 to parse and grammatically reconstruct the search term 112, and filter words in the text 114 and/or historical search records 126 that are not related to the content or attributes of the search term 112 to ensure data retrieval. Correctness and continuity.

此外,門檻值設定模組130用以設定滿足第一或第二門檻值的搜尋記錄的數量。搜尋記錄的數量不限定為只有同一詞彙的搜尋詞112累積的數量,亦可為不同詞彙但語意相近的同一類型的搜尋詞112累積的數量。當不同用戶對於同一類型的搜尋詞112或相似的搜尋詞112進行搜尋,本系統100可對同一類型或相似的搜尋詞112的搜尋記錄進行累加或進行權重處理,當系統100累加的搜尋記錄的數量達到一門檻值132時,本系統100的演化模組140再根據搜尋記錄的數量多寡自適應性調整搜尋流程,如第2、3及4圖所示。In addition, the threshold setting module 130 is used to set the number of search records that satisfy the first or second threshold. The number of search records is not limited to the accumulated number of search words 112 with only the same vocabulary, but may also be the accumulated number of search words 112 of the same type with different vocabularies but similar semantic meanings. When different users search for the same type of search terms 112 or similar search terms 112, the system 100 can accumulate or perform weighting on the search records of the same type or similar search terms 112. When the system 100 accumulates the search records When the number reaches a threshold 132, the evolution module 140 of the system 100 adjusts the search process adaptively according to the number of search records, as shown in Figures 2, 3, and 4.

請參照第1圖,本系統100更可包括一斷詞模組146、一記錄關連詞產生模組160以及一文本關連詞產生模組150。索引詞表144包含一組字串列表,每一字串可以由一至多個文數字或符號組成,索引詞表可經由人工預先設定,或是一般通用字典或專業領域字典,或是經由斷詞模組146分析文本114內容後,彙集所有字串詞組而成為索引詞表144,或可以是混合前述方式之組合,例如結合專業領域字典及文本經斷詞模組146分析後之所有詞彙。文本114的內容可以是文件、網頁或是資料庫的指定資料表或資料欄位,例如搜尋系統的標的若是商品,則文本114的內容可以是商品資料庫中商品資料表的商品名稱、商品描述、商品關鍵字等資料庫欄位,以及商品說明網頁內容。Referring to FIG. 1, the system 100 may further include a word segmentation module 146, a record related word generation module 160 and a text related word generation module 150. The index vocabulary 144 includes a set of word lists, and each string can be composed of one or more alphanumeric characters or symbols. The index vocabulary can be pre-set manually, or a general general dictionary or a professional dictionary, or by word segmentation After analyzing the content of the text 114, the module 146 aggregates all the string phrases to form the index vocabulary 144, or it may be a combination of the foregoing methods, for example, combining all the words analyzed by the word segmentation module 146 by combining the professional field dictionary and the text. The content of the text 114 can be a specified data table or data field of a document, web page, or database. For example, if the target of the search system is a product, the content of the text 114 can be the product name and product description of the product data table in the product database , Product keywords and other database fields, as well as product description webpage content.

斷詞模組146可將用戶輸入的搜尋詞112(例如中文字詞)分為有意義的詞組。例如:用戶輸入的搜尋詞112為晶片讀卡機,斷詞模組146可將晶片讀卡機分為晶片以及讀卡機,或者只有讀卡機。因此,當搜尋詞112不存在文本114中時,斷詞模組146根據索引詞表144進行字節解析、字詞解析或字詞比對等方式,將搜尋詞112拆解為至少一索引詞,以供搜尋引擎102進一步搜尋文本114中出現的索引詞。上述的中文字詞可採用基於辭典的斷詞算法、正向最大匹配算法、逆向最大匹配算法或雙向最大匹配算法、或以語料庫為基礎的統計斷詞算法如條件隨機場(Conditional Random Fields, CRF)或深度神經網路 (Deep Neural Networks, DNN)等進行分詞,本發明不以此為限。The word-breaking module 146 can divide the search words 112 (for example, Chinese words) input by the user into meaningful phrases. For example, the search word 112 input by the user is a chip card reader, and the word breaker module 146 may divide the chip reader into a chip and a card reader, or only a card reader. Therefore, when the search word 112 does not exist in the text 114, the word breaker module 146 performs byte analysis, word analysis, or word comparison according to the index word table 144, and disassembles the search word 112 into at least one index word. For the search engine 102 to further search for the index words appearing in the text 114. The above-mentioned Chinese words can use dictionary-based word breaking algorithm, forward maximum matching algorithm, reverse maximum matching algorithm or bidirectional maximum matching algorithm, or a corpus-based statistical word breaking algorithm such as Conditional Random Fields (CRF) ) Or deep neural networks (Deep Neural Networks, DNN), etc., the present invention is not limited to this.

此外,文本關連詞產生模組150可根據索引詞表144,用以分析文本114中與搜尋詞112最相關的前M個索引詞,以產生一文本關連詞表152。M例如為5個或大於5個的正整數。如上所述,在一實施例中,文本關連詞產生模組150可藉由搜尋詞112與索引詞單獨出現或共同出現在文本114中的機率計算一關連強度,關連強度越強,表示關連程度越強,反之,關連強度越弱,表示關連程度越差。上述的關連強度的計算可藉由關連關則學習法、逐點互信息演算法(Pointwise Mutual Information, PMI)、PMI改進演算法、KL散度演算法(Kullback–Leibler divergence)、標準化Google距離演算法、基於Wordnet距離的演算法來達成,本發明不以此為限。In addition, the text related word generation module 150 can analyze the first M index words in the text 114 that are most relevant to the search word 112 according to the index word list 144 to generate a text related word list 152. M is, for example, 5 or more positive integers. As described above, in one embodiment, the text related word generation module 150 can calculate a related strength by the probability that the search word 112 and the index word appear alone or co-exist in the text 114. The stronger the related strength, the greater the relatedness. The stronger, conversely, the weaker the connection strength, the worse the connection. The calculation of the above-mentioned correlation strength can be achieved by the correlation learning method, Pointwise Mutual Information (PMI), PMI improved algorithm, KL divergence algorithm (Kullback–Leibler divergence), standardized Google distance algorithm Method and algorithm based on Wordnet distance, the invention is not limited to this.

另外,記錄關連詞產生模組160,用以分析歷史搜尋記錄122中任兩個歷史搜尋詞之間的關連程度,找出與搜尋詞112最相關的前N個歷史搜尋詞,以產生一記錄關連詞表162。N例如為5個或大於5個的正整數。如上所述,在一實施例中,記錄關連詞產生模組160可藉由目前搜尋詞112與歷史搜尋詞的內容或屬性單獨出現或共同出現在歷史搜尋記錄122中的機率計算一關連強度,關連強度越強,表示關連程度越強,反之,關連強度越弱,表示關連程度越差。此外,關連程度除了比對詞彙內容出現位置之外,亦可以根據搜尋詞在歷史搜尋記錄122中的其它屬性,例如點擊位置、點擊次數、瀏覽時間等屬性計算關連程度,上述的關連強度的計算例如採用逐點互信息演算法(Pointwise Mutual Information, PMI),但亦可藉由其他演算法,例如關連關則學習法、PMI改進演算法、KL散度演算法(Kullback–Leibler divergence)、標準化Google距離演算法、基於wordnet距離的演算法來達成,本發明不以此為限。In addition, the record related word generation module 160 is used to analyze the degree of connection between any two historical search words in the historical search record 122 and find the top N historical search words most relevant to the search word 112 to generate a record Related words table 162. N is, for example, 5 or more positive integers. As described above, in one embodiment, the record related word generation module 160 may calculate a related strength by the probability that the content or attribute of the current search word 112 and the historical search word appear alone or co-exist in the historical search record 122, The stronger the connection strength, the stronger the connection. On the contrary, the weaker the connection strength, the worse the connection. In addition, in addition to comparing the occurrence position of the vocabulary content, the correlation degree can also be calculated according to other attributes of the search term in the historical search record 122, such as click position, number of clicks, browsing time and other attributes. For example, Pointwise Mutual Information (PMI) is used, but other algorithms can also be used, such as related learning method, PMI improved algorithm, KL divergence algorithm (Kullback–Leibler divergence), standardization The Google distance algorithm and the algorithm based on wordnet distance are used to achieve this, and the invention is not limited to this.

請參照第1圖,為了對中期搜尋流程進行優化,本系統100更包括一關連詞鑑別度計算模組170以及一關連詞推薦模組174。關連詞鑑別度計算模組170可根據文本114、索引詞表144、記錄關連詞表162以及文本關連詞表152計算各關連詞148的鑑別值172。鑑別值172是用以判斷關連詞148的獨特程度,也就是用以衡量關連詞148在文本114中差異程度的一種指標。並且可以用以增進關連詞表的多元化程度,避免推薦的關連詞過於雷同的問題。當關連詞148只出現在某一文本114中,鑑別值越高;當關連詞148同時出現在多個文本114中,鑑別值越低。例如,在多個文本114中,某一個關連詞148的獨特程度與關連詞148出現在此些文本114中的篇數的頻率(document frequency,簡稱DF)成反比的關係,即逆向文件頻率(inverse document frequency,簡稱IDF)。因此,關連詞鑑別度計算模組170可採用例如逆向文件頻率算法、殘餘逆向文件頻率(RIDF)算法或鑑別力算法(discrimination power),本發明不以此為限,來計算各關連詞148的鑑別值172,並建立關連詞148與鑑別值172的匹配表。Please refer to FIG. 1. In order to optimize the mid-term search process, the system 100 further includes a related word discrimination calculation module 170 and a related word recommendation module 174. The related word discrimination calculation module 170 may calculate the discrimination value 172 of each related word 148 according to the text 114, the index word table 144, the recorded related word table 162, and the text related word table 152. The discrimination value 172 is used to judge the unique degree of the related word 148, that is, an index used to measure the difference degree of the related word 148 in the text 114. And it can be used to increase the diversity of the list of related words, to avoid the problem of too many similar related words. When the related words 148 only appear in a certain text 114, the higher the discrimination value; when the related words 148 appear in multiple texts 114 at the same time, the lower the discrimination value. For example, in multiple texts 114, the uniqueness of a related word 148 is inversely proportional to the frequency (document frequency, DF for short) of the related words 148 appearing in these texts 114, that is, the inverse document frequency ( inverse document frequency, referred to as IDF). Therefore, the related word discrimination calculation module 170 may use, for example, a reverse document frequency algorithm, a residual reverse document frequency (RIDF) algorithm, or a discrimination power algorithm. The present invention is not limited to this to calculate the correlation words 148. Discrimination value 172, and establish a matching table of related words 148 and discrimination value 172.

在一實施例中,當某一個關連詞148存在於索引詞表144中,關連詞鑑別度計算模組170直接計算該索引詞的鑑別值。當某一個關連詞148不存在於索引詞表144中,斷詞模組146將某一個關連詞148進行分詞後,關連詞鑑別度計算模組170再針對分詞後的各索引詞計算鑑別值,再將該些鑑別值以取其中最小值、或最大值、或算術平均值、或加權平均值等方式估計該關連詞148的鑑別值。In one embodiment, when a certain related word 148 exists in the index word table 144, the related word discrimination calculation module 170 directly calculates the discrimination value of the index word. When a certain related word 148 does not exist in the index word list 144, after the word segmentation module 146 performs word segmentation of a certain related word 148, the related word discrimination degree calculation module 170 calculates the discrimination value for each index word after the word segmentation, Then, the discrimination values of the related words 148 are estimated by taking the minimum value, the maximum value, the arithmetic average value, or the weighted average value among the discrimination values.

在一實施例中,本系統100更包含一新詞辨識模組142可從一給定詞彙中擷取出不包含在索引詞表中的新詞。新詞辨識模組142的計算方式可以透過語言規則如音韻規則或文法規則或構詞規則等方式,或是透過統計模型如隱藏式馬爾可夫模型 (Hidden Markov Model, HMM)、條件隨機場(Conditional Random Fields, CRF)、支持向量機(Support Vector Machine, SVM)、深度神經網路(Deep Neural Network, DNN),或是透過特定統計量如逐點互信息(Pointwise Mutual Information, PMI)演算法等方式計算。當某一個關連詞148不存在於索引詞表144中,新詞辨識模組142從該關連詞148中擷取出辨識為新詞之部分字串後,給予評估之鑑別值,新詞鑑別值的計算方式可以是一預先設定之固定數值,或是動態由索引詞表144中所有詞彙鑑別度之最大值或最大值之加權數值。而該關連詞中非新詞的字串部分則可繼續依據索引詞表144計算,若是存在於索引詞表144中,關連詞鑑別度計算模組170直接計算該索引詞的鑑別值。最後取得新詞與非新詞部分字組之鑑別值,再將該些鑑別值以取其中最小值、或最大值、或算術平均值、或加權平均值等方式估計該關連詞148的鑑別值。若是該非新詞的字串部分不存在於索引詞表144中,斷詞模組146將該字串進行分詞後得到至少一索引詞,關連詞鑑別度計算模組170再針對分詞後的各索引詞計算鑑別值,最後取得新詞與非新詞部分字組之鑑別值,再將該些鑑別值以取其中最小值、或最大值、或算術平均值、或加權平均值等方式估計該關連詞148的鑑別值。In one embodiment, the system 100 further includes a new word recognition module 142 that can extract new words that are not included in the index word list from a given word. The calculation method of the new word recognition module 142 can be through language rules such as phonological rules or grammatical rules or word formation rules, or through statistical models such as Hidden Markov Model (HMM), conditional random field ( Conditional Random Fields (CRF), Support Vector Machine (SVM), Deep Neural Network (DNN), or through specific statistics such as Pointwise Mutual Information (PMI) algorithm Calculated in other ways. When a related word 148 does not exist in the index word list 144, the new word recognition module 142 extracts a part of the character string recognized as a new word from the related word 148, and then gives the discriminated value for evaluation. The calculation method may be a predetermined fixed value, or a dynamic weighted value based on the maximum value or the maximum value of the discrimination of all words in the index vocabulary 144. The non-new word part of the related word can continue to be calculated according to the index word table 144. If it exists in the index word table 144, the related word discrimination degree calculation module 170 directly calculates the discrimination value of the index word. Finally, the discriminant value of the new and non-new word partial phrases is obtained, and then the discriminant value of the related word 148 is estimated by taking the minimum value, or the maximum value, or the arithmetic average, or the weighted average value of these discriminating values. . If the part of the string of the non-new word does not exist in the index word list 144, the word segmentation module 146 performs word segmentation on the string to obtain at least one index word, and the related word discrimination calculation module 170 then targets each index after word segmentation Calculate the discrimination value of the word, and finally obtain the discrimination value of the new word and the non-new word part of the word group, and then use these minimum values, or the maximum value, or the arithmetic average, or the weighted average to estimate the relationship The discriminator value of word 148.

此外,關連詞推薦模組174用以比較記錄關連詞表162中各關連詞148的鑑別值以及文本關連詞表152中各關連詞148的鑑別值,並根據各關聯詞148的鑑別值的排序,從文本關連詞表152及記錄關連詞表162中挑選鑑別值較高的前P個關連詞148。P例如是5個或大於5個的正整數。如此,即可完成適合建議的關連搜尋詞表176。In addition, the related word recommendation module 174 is used to compare the discriminated value of each related word 148 in the recorded related word table 162 and the discriminated value of each related word 148 in the text related word table 152, and according to the order of the discriminated values of the related words 148, From the text related word list 152 and the recorded related word list 162, select the first P related words 148 with a higher discrimination value. P is, for example, 5 or more positive integers. In this way, a related search vocabulary 176 suitable for suggestions can be completed.

請參照第1及2圖,其中第2圖繪示依照本發明一實施例的自適應性調整關連搜尋詞176的系統100進行初期搜尋流程的示意圖,其包含步驟S11-S14。請參照步驟S11及S12,判斷搜尋詞112是否在搜尋記錄中,若有,進一步判斷搜尋詞112的累計搜尋次數是否小於第二門檻值。當符合上述兩個條件,演化模組140執行一初期搜尋流程,此時,由於搜尋詞112未存在於歷史搜尋記錄122中或搜尋詞112的累計搜尋次數非常少,因此搜尋引擎102無法根據目前的搜尋詞112找出適合建議的歷史搜尋詞。請參照步驟S13及S14,判斷搜尋詞112是否在一文本114中,若沒有,斷詞模組146根據索引詞表144將搜尋詞112拆解為至少一索引詞,並回到步驟S11中,進一步判斷索引詞是否在搜尋記錄中。當搜尋詞112存在一文本114中,文本關連詞產生模組150可根據內建的文本114及索引詞表144找出一文本114中與搜尋詞112的內容或屬性相關的至少一關連詞148。Please refer to FIGS. 1 and 2, wherein FIG. 2 is a schematic diagram of an initial search process performed by the system 100 for adaptively adjusting related search terms 176 according to an embodiment of the present invention, which includes steps S11-S14. Please refer to steps S11 and S12 to determine whether the search term 112 is in the search record, and if so, to further determine whether the cumulative search times of the search term 112 is less than the second threshold. When the above two conditions are met, the evolution module 140 performs an initial search process. At this time, since the search term 112 does not exist in the historical search record 122 or the cumulative number of searches for the search term 112 is very small, the search engine 102 cannot Of search terms 112 to find historical search terms that are suitable for suggestions. Please refer to steps S13 and S14 to determine whether the search word 112 is in a text 114. If not, the word breaker module 146 disassembles the search word 112 into at least one index word according to the index word table 144, and returns to step S11. Further determine whether the index word is in the search record. When the search word 112 exists in a text 114, the text related word generation module 150 can find at least one related word 148 in a text 114 related to the content or attribute of the search word 112 according to the built-in text 114 and the index word list 144 .

接著,請參照第1及3圖,其中第3圖繪示依照本發明一實施例的自適應性調整關連搜尋詞的系統100對中期搜尋流程進行優化的示意圖。本實施例之流程步驟與上述實施例相同,不同之處在於:在步驟S12中,當搜尋詞112的累計搜尋次數大於第二門檻值且小於第一門檻值時,系統100累積一定數量的搜尋記錄,可供演化模組140執行一中期搜尋流程。此時,記錄關連詞產生模組160可根據一歷史搜尋記錄122找出與搜尋詞112的內容或屬性相關的至少一歷史搜尋詞。因此,搜尋引擎102除了可根據目前的搜尋詞112找出適合建議的關連詞148之外,還可根據內建的文本114及索引詞表144找出適合建議的關連詞148,之後,再透過關連詞鑑別度計算模組170及新詞辨識模組142產生關連詞之鑑別值,再透過關連詞推薦模組174之挑選,進一步找出與搜尋詞112的內容或屬性相關最大化的至少一關連詞148及/或至少一歷史搜尋詞,用以取得最適化的搜尋關連詞表176。Next, please refer to FIGS. 1 and 3, wherein FIG. 3 shows a schematic diagram of the system 100 for adaptively adjusting related search terms according to an embodiment of the present invention to optimize the mid-term search process. The process steps of this embodiment are the same as the above embodiments, except that in step S12, when the cumulative search times of the search term 112 is greater than the second threshold and less than the first threshold, the system 100 accumulates a certain number of searches The record can be used by the evolution module 140 to perform an intermediate search process. At this time, the record related word generation module 160 may find at least one historical search word related to the content or attribute of the search word 112 according to a historical search record 122. Therefore, the search engine 102 can not only find suitable related words 148 according to the current search word 112, but also find suitable related words 148 according to the built-in text 114 and the index word list 144. The related word recognition degree calculation module 170 and the new word recognition module 142 generate the related word discrimination value, and then select through the related word recommendation module 174 to further find at least one that maximizes the correlation with the content or attribute of the search word 112 The related words 148 and/or at least one historical search word are used to obtain an optimized search related word list 176.

接著,請參照第1及4圖,其中第4圖繪示依照本發明一實施例的自適應性調整關連搜尋詞的系統100進行後期搜尋流程的示意圖,其省略初期階段中步驟S13及S14的文本搜尋流程,僅進行步驟S11及S12之判斷步驟即可。在本實施例中,當搜尋詞112出現在歷史搜尋記錄122中,且搜尋詞112的累計搜尋次數大於第一門檻值且大於第二門檻值時,由於系統100已累積足夠數量的歷史搜尋記錄122,因此可供演化模組140執行一後期搜尋流程。此時,記錄關連詞產生模組160可根據一歷史搜尋記錄122找出與搜尋詞112的內容或屬性相關的至少一歷史搜尋詞。因此,搜尋引擎102不需根據內建的文本114及索引詞表144找出適合建議的關連詞148,而是直接根據目前的搜尋詞112從一歷史搜尋記錄122中找出適合建議的關連詞148。第一門檻值與第二門檻值為搜尋詞112的累計搜尋次數,可根據一般性統計大樣本數概念決定(樣本數大於30),或是根據相同領域與相似規模的搜尋系統進行決定,例如在購物搜尋領域,可以依據相似產品數量的案例中,達到用戶覺得滿意的記錄關連詞所需足夠之累計搜尋次數,用以設定第一與第二門檻值。或是可以在搜尋系統100使用過程中,由領域專家依據搜尋結果動態調整第一與第二門檻值,用以調整初期階段進化到後期階段的快慢程度,或由中期或後期階段退化回前一期階段。Next, please refer to FIGS. 1 and 4, wherein FIG. 4 illustrates a schematic diagram of the post-search process of the system 100 for adaptively adjusting related search terms according to an embodiment of the present invention, which omits steps S13 and S14 in the initial stage. For the text search process, only the judgment steps of steps S11 and S12 can be performed. In this embodiment, when the search term 112 appears in the historical search record 122, and the cumulative search times of the search term 112 is greater than the first threshold value and greater than the second threshold value, because the system 100 has accumulated a sufficient number of historical search records 122, so that the evolution module 140 can perform a later search process. At this time, the record related word generation module 160 may find at least one historical search word related to the content or attribute of the search word 112 according to a historical search record 122. Therefore, the search engine 102 does not need to find the related keywords 148 suitable for suggestions based on the built-in text 114 and the index vocabulary 144, but directly finds the related keywords suitable for suggestions from a historical search record 122 based on the current search word 112 148. The first threshold and the second threshold are the cumulative search times of the search term 112, which can be determined according to the concept of general statistical large sample size (the sample size is greater than 30), or based on the search system of the same field and similar scale, for example In the field of shopping search, based on the number of similar products, enough cumulative search times are needed to reach the user’s satisfaction for the record related words to set the first and second thresholds. Or, in the process of using the search system 100, domain experts can dynamically adjust the first and second thresholds according to the search results, to adjust the speed of the early stage of evolution to the late stage, or from the middle or late stage to the previous one. Stage.

在一實施例中,上述自適應性調整關連搜尋詞176的方法可以實作為一軟體程式,此軟體程式可儲存於非暫態電腦可讀取媒體(non-transitory computer readable medium),例如硬碟、光碟、隨身碟、記憶體等程式儲存裝置,當處理器從非暫態電腦可讀取媒體載入此軟體程式時,可執行如第2、3及4圖的方法流程,將一個初期搜尋流程進化為一中期搜尋流程,再由一中期搜尋流程進化為一後期搜尋流程。In one embodiment, the method for adaptively adjusting the related search term 176 can be implemented as a software program, which can be stored in a non-transitory computer readable medium, such as a hard disk , CD, flash drive, memory and other program storage devices. When the processor loads this software program from a non-transitory computer readable medium, it can perform the method flow shown in Figures 2, 3, and 4 to perform an initial search. The process evolves into a mid-term search process, and then evolves from a mid-term search process to a late-stage search process.

在一實施例中,自適應性調整關連搜尋詞的系統100可包括處理器及程式儲存裝置,處理器能夠執行一或多個電腦可執行指令,程式儲存裝置儲存可由處理器執行的電腦程式模組,其中電腦程式模組在由處理器執行時使處理器進行如第2、3、4圖所示各步驟的操作。In one embodiment, the system 100 for adaptively adjusting related search terms may include a processor and a program storage device. The processor can execute one or more computer-executable instructions, and the program storage device stores a computer program model executable by the processor. Group, in which the computer program module, when executed by the processor, causes the processor to perform the operations as shown in Figures 2, 3, and 4.

在另一實施例中,上述的記錄蒐集模組120、門檻值設定模組130、演化模組140、新詞辨識模組142、文本關連詞產生模組150、記錄關連詞產生模組160、關連詞鑑別度計算模組170、關連詞推薦模組174可以個別被實施為軟體單元或硬體單元,亦可以部分模組合併以軟體實施、部分模組合併以硬體實施。以軟體實施的模組,可視為一操作流程,即記錄蒐集流程、門檻值設定流程、演化流程、新詞辨識流程、文本關連詞產生流程、記錄關連詞產生流程、關連詞鑑別度計算流程、關連詞推薦流程等,可被處理器載入而執行對應的功能。以硬體實施的模組,例如可被實施為微控制單元(microcontroller)、微處理器(microprocessor)、數位訊號處理器(digital signal processor)、特殊應用積體電路(application specific integrated circuit,ASIC)、數位邏輯電路、或現場可程式邏輯閘陣列(field programmable gate array,FPGA)。In another embodiment, the aforementioned record collection module 120, threshold setting module 130, evolution module 140, new word recognition module 142, text related word generation module 150, recorded related word generation module 160, The related word discrimination calculation module 170 and the related word recommendation module 174 may be individually implemented as software units or hardware units, or some modules may be combined and implemented by software, and some modules are combined and implemented by hardware. The module implemented by software can be regarded as an operation process, that is, record collection process, threshold setting process, evolution process, new word recognition process, text related word generation process, record related word generation process, related word discrimination degree calculation process, Related word recommendation process, etc., can be loaded by the processor to perform the corresponding function. A module implemented in hardware, for example, can be implemented as a microcontroller, a microprocessor, a digital signal processor, and an application specific integrated circuit (ASIC) , Digital logic circuit, or field programmable gate array (FPGA).

本發明上述實施例所揭露之自適應性調整關連搜尋詞的系統及其方法,可根據系統累積的搜尋記錄的數量自我調整關連搜尋詞,以提供適合建議用戶的關連搜尋詞,因而能夠減少系統程式開發所需的人力以及時間成本,並且沒有需要預先學習第一組權重組合的問題,亦沒有垂直領域轉換學習的問題。此外,本發明同時亦考慮到搜尋詞推薦流程可以隨搜尋記錄變化而不斷演化的情形,建立正確率更高的搜尋詞推薦機制,如此能夠避免單一化搜尋詞推薦流程可能產生與搜尋詞的內容或屬性不相關的關連詞的問題,增加管理的便利性並提高使用彈性。The system and method for adaptively adjusting related search terms disclosed in the above embodiments of the present invention can self-adjust related search terms according to the number of accumulated search records of the system to provide related search terms suitable for suggesting users, thereby reducing the system The manpower and time cost required for program development, and there is no need to learn the first set of weight combinations in advance, and there is no problem of vertical field conversion learning. In addition, the present invention also considers the situation that the search term recommendation process can continue to evolve as the search record changes, and establishes a higher accuracy search term recommendation mechanism, which can avoid a single search term recommendation process that may generate content related to the search term Or the problem of related words whose attributes are not related, increase the convenience of management and improve the flexibility of use.

綜上所述,雖然本發明已以實施例揭露如上,然其並非用以限定本發明。本發明所屬技術領域中具有通常知識者,在不脫離本發明之精神和範圍內,當可作各種之更動與潤飾。因此,本發明之保護範圍當視後附之申請專利範圍所界定者為準。In summary, although the present invention has been disclosed as above with examples, it is not intended to limit the present invention. Those with ordinary knowledge in the technical field to which the present invention belongs can make various modifications and retouching without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be deemed as defined by the scope of the attached patent application.

100‧‧‧自適應性調整關連搜尋詞的系統 102‧‧‧搜尋引擎 110‧‧‧輸入裝置 112‧‧‧搜尋詞 114‧‧‧文本 120‧‧‧記錄蒐集模組 122‧‧‧歷史搜尋記錄 124‧‧‧記錄資料庫 126‧‧‧文本資料庫 130‧‧‧門檻值設定模組 132‧‧‧門檻值 140‧‧‧演化模組 142‧‧‧新詞辨識模組 144‧‧‧索引詞表 146‧‧‧斷詞模組 148‧‧‧關連詞 150‧‧‧文本關連詞產生模組 152‧‧‧文本關連詞表 160‧‧‧記錄關連詞產生模組 162‧‧‧記錄關連詞表 170‧‧‧關連詞鑑別度計算模組 172‧‧‧鑑別值 174‧‧‧關連詞推薦模組 176‧‧‧關連搜尋詞表100‧‧‧Adaptive system for adjusting related search words 102‧‧‧ search engine 110‧‧‧Input device 112‧‧‧ search terms 114‧‧‧ text 120‧‧‧Record collection module 122‧‧‧History search record 124‧‧‧Record database 126‧‧‧ Text database 130‧‧‧ Threshold value setting module 132‧‧‧ Threshold 140‧‧‧Evolution module 142‧‧‧New word recognition module 144‧‧‧ Index word list 146‧‧‧ Word Breaking Module 148‧‧‧ related words 150‧‧‧ text related word generation module 152‧‧‧ List of related words 160‧‧‧Record related word generation module 162‧‧‧List of related words 170‧‧‧ related word discrimination calculation module 172‧‧‧discrimination value 174‧‧‧Related Links Recommendation Module 176‧‧‧ related search vocabulary

第1圖繪示依照本發明一實施例的自適應性調整關連搜尋詞的系統的示意圖。 第2圖繪示依照本發明一實施例的自適應性調整關連搜尋詞的系統進行初期搜尋流程的示意圖。 第3圖繪示依照本發明一實施例的自適應性調整關連搜尋詞的系統對中期搜尋流程進行優化的示意圖。 第4圖繪示依照本發明一實施例的自適應性調整關連搜尋詞的系統進行後期搜尋流程的示意圖。FIG. 1 is a schematic diagram of a system for adaptively adjusting related search terms according to an embodiment of the invention. FIG. 2 is a schematic diagram of an initial search process of a system for adaptively adjusting related search terms according to an embodiment of the invention. FIG. 3 is a schematic diagram of the system for adaptively adjusting related search terms according to an embodiment of the present invention to optimize the mid-term search process. FIG. 4 is a schematic diagram of the post-search process of the system for adaptively adjusting related search terms according to an embodiment of the invention.

100‧‧‧自適應性調整關連搜尋詞的系統 100‧‧‧Adaptive system for adjusting related search words

102‧‧‧搜尋引擎 102‧‧‧ search engine

110‧‧‧輸入裝置 110‧‧‧Input device

112‧‧‧搜尋詞 112‧‧‧ search terms

114‧‧‧文本 114‧‧‧ text

120‧‧‧記錄蒐集模組 120‧‧‧Record collection module

122‧‧‧歷史搜尋記錄 122‧‧‧History search record

124‧‧‧記錄資料庫 124‧‧‧Record database

126‧‧‧文本資料庫 126‧‧‧ Text database

130‧‧‧門檻值設定模組 130‧‧‧ Threshold value setting module

132‧‧‧門檻值 132‧‧‧ Threshold

140‧‧‧演化模組 140‧‧‧Evolution module

142‧‧‧新詞辨識模組 142‧‧‧New word recognition module

144‧‧‧索引詞表 144‧‧‧ Index word list

146‧‧‧斷詞模組 146‧‧‧ Word Breaking Module

148‧‧‧關連詞 148‧‧‧ related words

150‧‧‧文本關連詞產生模組 150‧‧‧ text related word generation module

152‧‧‧文本關連詞表 152‧‧‧ List of related words

160‧‧‧記錄關連詞產生模組 160‧‧‧Record related word generation module

162‧‧‧記錄關連詞表 162‧‧‧List of related words

170‧‧‧關連詞鑑別度計算模組 170‧‧‧ related word discrimination calculation module

172‧‧‧鑑別值 172‧‧‧discrimination value

174‧‧‧關連詞推薦模組 174‧‧‧Related Links Recommendation Module

176‧‧‧關連搜尋詞表 176‧‧‧ related search vocabulary

Claims (18)

一種自適應性調整關連搜尋詞的系統,包括: 一輸入裝置,用以接收一搜尋詞; 一記錄蒐集模組,用以判斷該搜尋詞的累計搜尋次數是否大於一第一門檻值或小於一第二門檻值; 一門檻值設定模組,用以設定滿足該第一或該第二門檻值的搜尋記錄的數量;以及 一演化模組,用以根據搜尋記錄的該數量多寡調整一搜尋流程,其中當該搜尋詞的累計搜尋次數大於該第一門檻值時,該演化模組根據一歷史搜尋記錄找出與該搜尋詞的內容或屬性相關的至少一歷史搜尋詞, 其中,當該搜尋詞的累計搜尋次數小於該第二門檻值時,該演化模組執行一初期搜尋流程,以找出一文本中與該搜尋詞的內容或屬性相關的至少一關連詞, 其中,當該搜尋詞的累計搜尋次數介於該第一門檻值與該第二門檻值之間時,該演化模組對中期搜尋流程進行優化,以進一步找出該文本中及該歷史搜尋記錄中與該搜尋詞的內容或屬性相關最大化的該至少一關連詞及/或該至少一歷史搜尋詞。A system for adaptively adjusting related search terms, including: an input device for receiving a search term; a record collection module for determining whether the accumulated search times of the search term is greater than a first threshold or less than one A second threshold; a threshold setting module for setting the number of search records that satisfy the first or the second threshold; and an evolution module for adjusting a search process based on the number of search records , Wherein when the cumulative search times of the search term are greater than the first threshold, the evolution module finds at least one historical search term related to the content or attribute of the search term according to a historical search record, wherein, when the search When the cumulative number of search terms for a word is less than the second threshold, the evolution module performs an initial search process to find at least one related word in a text that is related to the content or attribute of the search term, where, when the search term When the cumulative number of searches is between the first threshold and the second threshold, the evolution module optimizes the mid-term search process to further find the search terms in the text and in the historical search records The at least one related word and/or the at least one historical search word that maximize content or attribute relevance. 如申請專利範圍第1項所述之系統,更包括: 一索引詞表;以及 一文本關連詞產生模組,根據該索引詞表,用以分析該文本中與該搜尋詞最相關的前M個索引詞,以產生一文本關連詞表;以及 一記錄關連詞產生模組,用以分析該歷史搜尋記錄中任兩個歷史搜尋詞之間的關連程度,找出與該搜尋詞最相關的前N個歷史搜尋詞,以產生一記錄關連詞表。The system as described in item 1 of the patent application scope further includes: an index vocabulary; and a text related word generation module, which is used to analyze the top M in the text most relevant to the search term according to the index vocabulary Index words to generate a list of text related words; and a record related word generation module to analyze the degree of connection between any two historical search words in the historical search record to find the most relevant to the search word The first N historical search words to generate a list of related words of record. 如申請專利範圍第2項所述之系統,其中該文本關連詞產生模組根據該搜尋詞與該些索引詞單獨出現或共同出現在該文本中的機率計算一關連強度。The system as described in item 2 of the patent application scope, wherein the text related word generation module calculates a related strength based on the probability that the search word and the index words appear alone or together in the text. 如申請專利範圍第2項所述之系統,其中該記錄關連詞產生模組根據該搜尋詞與該些歷史搜尋詞的內容或屬性單獨出現或共同出現在該歷史搜尋記錄中的機率計算一關連強度。The system as described in item 2 of the patent application scope, wherein the record related word generation module calculates a connection based on the probability that the search term and the content or attributes of the historical search terms appear alone or together in the historical search record strength. 如申請專利範圍第2項所述之系統,更包括: 一關連詞鑑別度計算模組,根據該索引詞表、該記錄關連詞表以及該文本關連詞表計算各該關連詞的鑑別值;以及 一關連詞推薦模組,用以比較該記錄關連詞表中各該關連詞的鑑別值以及該文本關連詞表中各該關連詞的鑑別值,並根據各該關聯詞的鑑別值的排序,從該文本關連詞表及該記錄關連詞表中挑選鑑別值較高的前P個關連詞。The system as described in item 2 of the patent application scope further includes: a related word discrimination calculation module, which calculates the discrimination value of each related word based on the index word list, the record related word list and the text related word list; And a related word recommendation module, used to compare the discriminated value of each related word in the recorded related word list with the discriminated value of each related word in the text related word list, and sorted according to the discriminated value of each related word, Select the top P related words with higher discrimination value from the text related word list and the recorded related word list. 如申請專利範圍第5項所述之系統,其中該關連詞鑑別度計算模組根據各該關連詞出現在該文本中的一差異程度計算鑑別值,該差異程度與各該關連詞出現在單一該文本或複數個該文本中的頻率有關。The system as described in item 5 of the patent application scope, wherein the related word discrimination degree calculation module calculates the discriminant value according to a degree of difference in which each related word appears in the text, and the difference degree and each related word appear in a single The text or frequencies in the text are related. 如申請專利範圍第2項所述之系統,更包括: 一斷詞模組,用以接收該搜尋詞,當該搜尋詞不存在該文本中時,該斷詞模組根據該索引詞表將該搜尋詞拆解為至少一索引詞。The system as described in item 2 of the patent application scope further includes: a segmentation module for receiving the search term, when the search term does not exist in the text, the segmentation module according to the index word list will The search term is disassembled into at least one index term. 如申請專利範圍第5項所述之系統,更包括: 一新詞辨識模組,用以辨識該關連詞是否包含未存在於該索引詞表之一新詞,其中當該關連詞包含該新詞時,該關連詞鑑別度計算模組根據該關連詞及包含的該新詞計算該關連詞的鑑別值。The system as described in item 5 of the patent application scope further includes: a new word recognition module for identifying whether the related word contains a new word that does not exist in the index vocabulary, wherein when the related word contains the new word When a word is used, the related word discrimination degree calculation module calculates the discrimination value of the related word based on the related word and the included new word. 如申請專利範圍第1至8項其中之一項所述之系統,其中該系統以一處理器執行或由該處理器載入的一軟體程式執行。The system as described in one of claims 1 to 8, wherein the system is executed by a processor or a software program loaded by the processor. 一種自適應性調整關連搜尋詞的方法,包括: 一輸入流程,用以接收一搜尋詞; 一記錄蒐集流程,用以判斷該搜尋詞的累計搜尋次數是否大於一第一門檻值或小於一第二門檻值; 一門檻值設定流程,用以設定滿足該第一或該第二門檻值的搜尋記錄的數量;以及 一演化流程,用以根據搜尋記錄的該數量多寡調整一搜尋流程,其中當該搜尋詞的累計搜尋次數大於該第一門檻值時,該演化流程根據一歷史搜尋記錄找出與該搜尋詞的內容或屬性相關的至少一歷史搜尋詞, 其中,當該搜尋詞的累計搜尋次數小於該第二門檻值時,該演化流程執行一初期搜尋流程,以找出一文本中與該搜尋詞的內容或屬性相關的至少一關連詞, 其中,當該搜尋詞的累計搜尋次數介於該第一門檻值與該第二門檻值之間時,該演化流程對中期搜尋流程進行優化,以進一步找出該文本中及該歷史搜尋記錄中與該搜尋詞的內容或屬性相關最大化的該至少一關連詞及/或該至少一歷史搜尋詞。A method for adaptively adjusting related search terms, including: an input process for receiving a search term; a record collection process for determining whether the cumulative search times of the search term is greater than a first threshold or less than a first Two thresholds; a threshold setting process for setting the number of search records that satisfy the first or the second threshold; and an evolution process for adjusting a search process based on the number of search records When the cumulative search times of the search term are greater than the first threshold, the evolution process finds at least one historical search term related to the content or attribute of the search term according to a historical search record, wherein, when the cumulative search of the search term When the number of times is less than the second threshold, the evolutionary process performs an initial search process to find at least one related word in a text that is related to the content or attribute of the search term. Between the first threshold and the second threshold, the evolutionary process optimizes the mid-term search process to further find the text and the historical search record to maximize the content or attribute related to the search term The at least one related word and/or the at least one historical search word. 如申請專利範圍第10項所述之方法,更包括: 建立一索引詞表;以及 一文本關連詞產生流程,根據該索引詞表,用以分析該文本中與該搜尋詞最相關的前M個索引詞,以產生一文本關連詞表;以及 一記錄關連詞產生流程,用以分析該歷史搜尋記錄中任兩個歷史搜尋詞之間的關連程度,找出與該搜尋詞最相關的前N個歷史搜尋詞,以產生一記錄關連詞表。The method described in item 10 of the patent application scope further includes: creating an index vocabulary; and a text related word generation process, based on the index vocabulary, to analyze the top M most relevant to the search term in the text Index words to generate a list of text related words; and a record related word generation process to analyze the degree of connection between any two historical search words in the historical search record to find the most relevant front N historical search words to generate a list of related words. 如申請專利範圍第11項所述之方法,其中該文本關連詞產生流程根據該搜尋詞與該些索引詞單獨出現或共同出現在該文本中的機率計算一關連強度。The method as described in item 11 of the patent application scope, wherein the text related word generation process calculates a related strength based on the probability that the search word and the index words appear alone or together in the text. 如申請專利範圍第11項所述之方法,其中該記錄關連詞產生流程根據該搜尋詞與些歷史搜尋詞的內容或屬性單獨出現或共同出現在該歷史搜尋記錄中的機率計算一關連強度。The method as described in item 11 of the patent application scope, wherein the record related word generation process calculates a related strength based on the probability that the search terms and the contents or attributes of the historical search terms appear alone or together in the historical search records. 如申請專利範圍第11項所述之方法,更包括: 一關連詞鑑別度計算流程,根據該索引詞表、該記錄關連詞表以及該文本關連詞表計算各該關連詞的鑑別值;以及 一關連詞推薦流程,用以比較該記錄關連詞表中各該關連詞的鑑別值以及該文本關連詞表中各該關連詞的鑑別值,並根據各該關聯詞的鑑別值的排序,從該文本關連詞表及該記錄關連詞表中挑選鑑別值較高的前P個關連詞。The method as described in item 11 of the patent application scope further includes: a process of calculating the degree of discrimination of related words, calculating the identification value of each related word based on the index word list, the recorded related word list and the text related word list; and A related word recommendation process for comparing the discriminating value of each related word in the related word list of the record with the discriminating value of each related word in the text related word list, and according to the order of the discriminating value of each related word, from the The first P related words with higher identification value are selected in the text related word list and the related word list of the record. 如申請專利範圍第14項所述之方法,其中該關連詞鑑別度計算流程根據各該關連詞出現在該文本中的一差異程度計算鑑別值,該差異程度與各該關連詞出現在單一該文本或複數個該文本中的頻率有關。The method as described in item 14 of the patent application scope, wherein the related word discrimination degree calculation process calculates a discriminant value according to a degree of difference in which each related word appears in the text, and the difference degree and each related word appear in a single The text or frequencies in the text are related. 如申請專利範圍第11項所述之方法,更包括: 一斷詞流程,用以接收該搜尋詞,當該搜尋詞不存在該文本中時,該斷詞流程根據該索引詞表將該搜尋詞拆解為至少一索引詞。The method as described in item 11 of the patent application scope further includes: a word segmentation process to receive the search term, and when the search term does not exist in the text, the word segmentation process will search according to the index vocabulary The word is disassembled into at least one index word. 如申請專利範圍第14項所述之方法,更包括: 一新詞辨識流程,用以辨識該關連詞是否包含未存在於該索引詞表之一新詞,其中當該關連詞包含該新詞時,該關連詞鑑別度計算流程根據該關連詞及包含的該新詞計算該關連詞的鑑別值。The method as described in item 14 of the patent application scope further includes: a new word recognition process to identify whether the related word contains a new word that does not exist in the index word list, wherein when the related word contains the new word At this time, the related word discrimination degree calculation process calculates the related word discrimination value based on the related word and the included new word. 如申請專利範圍第11至17項其中之一項所述之方法,其中該方法以一處理器執行或由該處理器載入的一軟體程式執行。The method according to one of the items 11 to 17 of the patent application scope, wherein the method is executed by a processor or a software program loaded by the processor.
TW107145181A 2018-12-14 2018-12-14 System and method for adaptively adjusting related search words TWI681304B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
TW107145181A TWI681304B (en) 2018-12-14 2018-12-14 System and method for adaptively adjusting related search words
CN201910088844.9A CN111324705B (en) 2018-12-14 2019-01-29 System and method for adaptively adjusting associated search terms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW107145181A TWI681304B (en) 2018-12-14 2018-12-14 System and method for adaptively adjusting related search words

Publications (2)

Publication Number Publication Date
TWI681304B true TWI681304B (en) 2020-01-01
TW202022635A TW202022635A (en) 2020-06-16

Family

ID=69942676

Family Applications (1)

Application Number Title Priority Date Filing Date
TW107145181A TWI681304B (en) 2018-12-14 2018-12-14 System and method for adaptively adjusting related search words

Country Status (2)

Country Link
CN (1) CN111324705B (en)
TW (1) TWI681304B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI755995B (en) * 2020-12-24 2022-02-21 科智企業股份有限公司 A method and a system for screening engineering data to obtain features, a method for screening engineering data repeatedly to obtain features, a method for generating predictive models, and a system for characterizing engineering data online
TWI787651B (en) * 2020-09-16 2022-12-21 洽吧智能股份有限公司 Method and system for labeling text segment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI742446B (en) * 2019-10-08 2021-10-11 東方線上股份有限公司 Vocabulary library extension system and method thereof

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200921422A (en) * 2007-07-31 2009-05-16 Yahoo Inc System and method for determining semantically related terms
US20100262603A1 (en) * 2002-02-26 2010-10-14 Odom Paul S Search engine methods and systems for displaying relevant topics
CN102184173A (en) * 2009-10-31 2011-09-14 佛山市顺德区汉达精密电子科技有限公司 Method for searching Internet data
CN102629257A (en) * 2012-02-29 2012-08-08 南京大学 Commodity recommending method of e-commerce website based on keywords
CN103077179A (en) * 2011-09-12 2013-05-01 吉菲斯股份有限公司 A computer-implemented method for displaying an individual timeline of a user of a social network, computer system and computer readable medium thereof
TWI421713B (en) * 2008-02-28 2014-01-01 Yahoo Inc System and/or method for personalization of searches
CN105930376A (en) * 2016-04-12 2016-09-07 广东欧珀移动通信有限公司 Search method and device
US20170169111A1 (en) * 2015-12-09 2017-06-15 Oracle International Corporation Search query task management for search system tuning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365839B (en) * 2012-03-26 2017-12-12 深圳市世纪光速信息技术有限公司 The recommendation searching method and device of a kind of search engine
GB201418402D0 (en) * 2014-10-16 2014-12-03 Touchtype Ltd Text prediction integration
CN105653533B (en) * 2014-11-13 2019-10-25 腾讯数码(深圳)有限公司 A kind of method and apparatus updating classification associated set of words
CN106649334B (en) * 2015-10-29 2020-09-15 北京国双科技有限公司 Processing method and device of associated word set

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100262603A1 (en) * 2002-02-26 2010-10-14 Odom Paul S Search engine methods and systems for displaying relevant topics
TW200921422A (en) * 2007-07-31 2009-05-16 Yahoo Inc System and method for determining semantically related terms
TWI421713B (en) * 2008-02-28 2014-01-01 Yahoo Inc System and/or method for personalization of searches
CN102184173A (en) * 2009-10-31 2011-09-14 佛山市顺德区汉达精密电子科技有限公司 Method for searching Internet data
CN103077179A (en) * 2011-09-12 2013-05-01 吉菲斯股份有限公司 A computer-implemented method for displaying an individual timeline of a user of a social network, computer system and computer readable medium thereof
CN102629257A (en) * 2012-02-29 2012-08-08 南京大学 Commodity recommending method of e-commerce website based on keywords
US20170169111A1 (en) * 2015-12-09 2017-06-15 Oracle International Corporation Search query task management for search system tuning
CN105930376A (en) * 2016-04-12 2016-09-07 广东欧珀移动通信有限公司 Search method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI787651B (en) * 2020-09-16 2022-12-21 洽吧智能股份有限公司 Method and system for labeling text segment
TWI755995B (en) * 2020-12-24 2022-02-21 科智企業股份有限公司 A method and a system for screening engineering data to obtain features, a method for screening engineering data repeatedly to obtain features, a method for generating predictive models, and a system for characterizing engineering data online

Also Published As

Publication number Publication date
CN111324705A (en) 2020-06-23
TW202022635A (en) 2020-06-16
CN111324705B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
US11544459B2 (en) Method and apparatus for determining feature words and server
WO2020244073A1 (en) Speech-based user classification method and device, computer apparatus, and storage medium
US20180300315A1 (en) Systems and methods for document processing using machine learning
CN106874441B (en) Intelligent question-answering method and device
US9361362B1 (en) Synonym generation using online decompounding and transitivity
US8751218B2 (en) Indexing content at semantic level
RU2517368C2 (en) Method and apparatus for determining and evaluating significance of words
JP6335898B2 (en) Information classification based on product recognition
WO2021068683A1 (en) Method and apparatus for generating regular expression, server, and computer-readable storage medium
TWI681304B (en) System and method for adaptively adjusting related search words
KR20060045786A (en) Verifying relevance between keywords and web site contents
CN104967558B (en) A kind of detection method and device of spam
US11790174B2 (en) Entity recognition method and apparatus
US11526512B1 (en) Rewriting queries
WO2024109619A1 (en) Sensitive data identification method and apparatus, device, and computer storage medium
US20150006563A1 (en) Transitive Synonym Creation
WO2017091985A1 (en) Method and device for recognizing stop word
US11720481B2 (en) Method, apparatus and computer program product for predictive configuration management of a software testing system
CN113688954A (en) Method, system, equipment and storage medium for calculating text similarity
Gacitua et al. Relevance-based abstraction identification: technique and evaluation
WO2023065642A1 (en) Corpus screening method, intention recognition model optimization method, device, and storage medium
CN108536665A (en) A kind of method and device of determining sentence consistency
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
US9183297B1 (en) Method and apparatus for generating lexical synonyms for query terms
CN114202443A (en) Policy classification method, device, equipment and storage medium