TWI232390B - Business search engine - Google Patents

Business search engine Download PDF

Info

Publication number
TWI232390B
TWI232390B TW90101730A TW90101730A TWI232390B TW I232390 B TWI232390 B TW I232390B TW 90101730 A TW90101730 A TW 90101730A TW 90101730 A TW90101730 A TW 90101730A TW I232390 B TWI232390 B TW I232390B
Authority
TW
Taiwan
Prior art keywords
concept
search engine
search
business
enterprise
Prior art date
Application number
TW90101730A
Other languages
Chinese (zh)
Inventor
Patrick S Chen
Original Assignee
Nat Science Council
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nat Science Council filed Critical Nat Science Council
Priority to TW90101730A priority Critical patent/TWI232390B/en
Application granted granted Critical
Publication of TWI232390B publication Critical patent/TWI232390B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The information need of a business may be specific. How to find this specific information on the Internet is a new and important task that can be effectively done by a business search engine (BSE) coined in this invention. Principles for constructing such a proprietary search engine for business are suggested, including system architecture of the BSE, construction of concept databases, comparison of search intentions with contents of Web pages using an adequate method and supports for managing search results. Based on these principles, we develop a workable system employing a rigorous methodology. Since a single keyword or a logical combination of keywords cannot represent the information need of a business adequately and completely, we represent it with the help of a knowledge frame. Accordingly, we develop an adequate match algorithm using the Feature Analysis Method suggested by the applicant, to evaluate the relevance of a Web page for the business information need. After matching, relevant pages are ranked for user browsing. Interesting (relevant) Web stations are organized for handy reference. The system has a feedback mechanism: relevant pages will first be verified by human experts and, then, analyzed by computers to abstract new concepts to be added to the concept database. Assessment of the search results reveals that the system achieves high retrieval precision. Thus, it is convinced to be capable of carrying out retrieval tasks to find useful information for supporting business decision-making. The construction of a knowledge frame and the design of a match program realizing the FAM are characteristic of the system. Another feature of this invention is automatic fetch of interesting Web pages for the business.

Description

!23239〇 五、發明說明(1) 一 ------- 產業上之利用領域 本發明揭示一種新穎的網頁内容與查詢概念比對之方 套了兩企業專屬搜尋引擎之檢索精確率(precisi〇n),完 成企業所交付之高難度之資料搜尋任務。 5 背 景 一般通用之搜尋引擎(General-Purpose Search Engine)大 都提供一個用戶介面,能將搜尋概念輸入系統内,以查詢 10所需之資料。習知之搜尋引擎的實作技術大抵可分為四類: (1) 以目錄為基礎的搜尋引擎(Directoiy-based search engines)如雅虎(Yahoo)、蕃薯藤等。! 23239〇 V. Description of the invention (1) One ------- Field of industrial use The present invention discloses a novel comparison of web content and query concept, which sets the search accuracy rate of two enterprise-specific search engines ( precisi〇n), complete the difficult data search task delivered by the enterprise. 5 Background Generally, the General-Purpose Search Engine provides a user interface, which can input search concepts into the system to query the required data. The implementation techniques of the known search engines can be divided into four categories: (1) Directoiy-based search engines such as Yahoo and sweet potato vines.

(2) 以索引為基礎的搜尋引擎(Index-based search engines)如殷幅(Infoseek),阿圖(AltaVista)等系統。(2) Index-based search engines such as Infoseek, AltaVista and other systems.

Is (3) 代理搜尋引擎(Metasearchers)如眉塔 (MetaCrawler)、莎圍(SawySearch)等系統。 (4)智慧型搜尋引擎(Intelligent Searchers),如勾葛Is (3) Metasearchers such as MetaCrawler, SawySearch and other systems. (4) Intelligent Searchers, such as Gouge

第5頁 1232390 五、發明說明(2) (Google)系統或其它個人化之搜尋引擎等。 以上系統大抵適合於立即型的搜尋(Ad hoc Search),但 是使用者於資料搜尋之後,卻常面對找不到合用的資料或 5資料太多,無法形成有用資訊之窘境,尤其對於想利用搜 尋之資訊作為決策支援的企業,常陷於力有所不迨之窘境。 習知技藝 1〇 現今雖然存在著許多通用之搜尋引擎,而這些搜尋引 擎也能支援即時查詢(Ad hoc search),進行一般之資料搜尋, 但對特殊之資訊需求,往往無法作有效之支援。 目前搜尋引擎所採用的設計方法主要有三種: 15 (1)以關鍵字為基礎之設計方法; (2)以目錄為基礎之設計方法; 1232390Page 5 1232390 5. Description of the invention (2) (Google) system or other personalized search engine. The above system is probably suitable for immediate search (Ad hoc Search), but after data search, users often face the dilemma of not finding suitable data or too much data, which cannot form useful information, especially for those who want to use Searching for information as a decision support company is often in a dilemma. Know-how. Although many common search engines exist today, these search engines can also support ad hoc search for general data search, but often cannot effectively support special information needs. There are three main design methods currently used by search engines: 15 (1) keyword-based design method; (2) directory-based design method; 1232390

,另外’巾魏时告遍观專彻之技軸容盘 本發明案不同。該專利旨在說明一種適用於網路同時進行 多個搜尋引擎檢索的方法,可以歸類於代理搜尋引^ (Metasearcher)的範疇。本發明之資料搜集均獨立為之,不 5假手其它搜哥引擎。這是兩者不同之處。 發明目標 本發明揭示一種新穎的企業專屬搜尋引擎(bse 10 Business Search Engine),系統包括網頁擷取器(WebIn addition, Wei ’s time and time again, he has watched a thorough technical axis, the present invention is different. The patent is intended to describe a method suitable for simultaneous search by multiple search engines on the Internet, which can be classified as a metasearcher. The data collection of the present invention is all done independently, without using other artificial search engines. This is the difference between the two. Object of the Invention The present invention discloses a novel bse 10 Business Search Engine. The system includes a web crawler (Web

Crawler)、比對程式(Match Program)、企業專屬概念資料庫 (Concept Database),及相關網站/網頁資料庫(Database 〇f Interesting Web Sites/Pages)等構件0 15 目前企業之資訊需求係利用關鍵字或關鍵字之組合來 表達,以彌補零星片斷,無法完整表達企業之資料搜尋意 圖。而一般通用搜尋引擎之比對程式是基於處理關鍵字(及Crawler), Match Program, Concept Database, and related websites / website databases (Database 0f Interesting Web Sites / Pages) and other components The combination of words or keywords is used to make up for sporadic fragments, and it is impossible to fully express the company's data search intentions. The general search engine comparison program is based on processing keywords (and

1232390 五、發明說明(4) 其组合)而設計的,因此常有查不到資料或查到太多資料之 缺點。因此,本發明的目標之一即是以「知識框架」來表 示企業之資料搜集概念,它比習用的關鍵字或關鍵字之組 合較為完整且有組織性,能妥善的表達企業之資訊需求。 5本發明之另一目標係揭示一種「特徵分析法(Feature Analysis Method)」’它是一種理想的網頁内容與查詢概念之比對方 法。 凡是熟悉該技藝的人士在閱讀下列經由圖解所展示之 1〇較佳實施例詳細說明後,無疑地將非常清楚本發明所揭示 之目的和優點。 發明之詳細說明 15 本發明之企業專屬搜尋引擎(BSE,Business Search Engine)系統,包括網頁擷取器(web Crawler)、比對程式 (Match Program)、企業專屬概念資料庫(Concept Database), i^bi mm 第8頁 1232390 五、發明說明(5) -------- 及相關網站/網頁資料庫(Database of Interesting Web Sites/pages)等四個主要構件。其中企業專屬搜尋引擎 糸、、先之網頁操取器(从处Crawler)及比對程式(Match Program) 係所有搜哥引擎共通之基本構件,而企業專屬概念資料庫 5係以「知識框架(Knowledge Frame)」表達企業之資訊需求 度。本系統之比對程式係依照企業所建議之資料,經由特 徵分析法(Feature Analysis Method)所建構。 本發明之企業專屬搜尋引擎(BSE)系統如圖一所示,其 10詳細依附圖構件之序號說明如下: ⑴資訊需求(InformationNeed);企業之特殊資訊需求,經 過自動分析之後,將抽象化(11)成為查詢概念。本發 明採用自動化之方法是叢集法(Clustering),找出幾個語義 15 場(Semantic Fields),構成該需求之幾個重要構面(Facet), 由上義詞(Hypemym)代表之。 (2)查詢概念(search Concepts); —種資訊需求將以如上所 述之知識框架(Frame)表示之,各個上義詞即形成此一框 第9頁 1232390 五、發明說明(6)1232390 V. Description of the invention (4) Its combination), so it often has the disadvantages of not finding information or finding too much information. Therefore, one of the objectives of the present invention is to use a "knowledge framework" to represent the concept of corporate data collection. It is more complete and organized than conventional keywords or combinations of keywords, and can properly express the information needs of the enterprise. 5 Another object of the present invention is to disclose a "Feature Analysis Method" which is an ideal method for comparing the content of web pages with the concept of query. Those skilled in the art will undoubtedly know the objects and advantages disclosed by the present invention after reading the following detailed description of the preferred embodiment shown in the drawings. Detailed Description of the Invention 15 The business search engine (BSE) system of the present invention includes a web crawler, a match program, an enterprise-specific Concept Database, i ^ bi mm Page 8 1232390 V. Description of the Invention (5) -------- and four main components such as Database of Interesting Web Sites / pages. Among them, the enterprise-specific search engine 先, the first web crawler (from Crawler), and the matching program (Match Program) are the basic components common to all search engine, and the enterprise-specific concept database 5 is based on the "knowledge framework ( Knowledge Frame) "expresses the information demand of the enterprise. The comparison program of this system is constructed by the Feature Analysis Method according to the information recommended by the enterprise. The enterprise-specific search engine (BSE) system of the present invention is shown in Figure 1. The details of the 10-item components according to the drawings are as follows: : InformationNeed; the special information needs of the enterprise will be abstracted after automatic analysis ( 11) Become a query concept. The method adopted by the present invention for automation is Clustering, which finds several Semantic Fields, which constitute several important facets of the demand, represented by the Hypemym. (2) Search Concepts; —A kind of information demand will be expressed by the above-mentioned knowledge frame (Frame), and each of the above words will form this frame. Page 9 1232390 V. Description of the invention (6)

架之屬性。以知識框架表示之資訊需求即形成本發明之 查詢概念(亦稱查詢依據)。 ⑶網際網路/網頁(Internet,WebPages);網頁擷取器按通 用資源疋位器(Uniform Resource Locator,URL)從該站上 依序搁取一個一個網頁。 ⑷文件處理器(Text Processor);對每一個*.html樓之網頁進 行初步處理,去除標記(Tag)。 (5)比對程式(Match Program);比對程式依「特徵分析法 ffeaturc Analysis Method)」對網頁與查詢概念之支持度 10 (Support)進行評分。網頁之搜尋與比對工作將自動在後台 不斷執行。 搜尋到的網頁(Retrieved Pages);各網頁將按其對某一特 疋之查珣概念之支持度(Support)加以排序,並定期提出結 果報告,以提供制者參考。 ()相關網頁/網站庫(Interesting Web sites/ data);被使用者 為相關之網頁將被存於相關網頁/網站資料庫。 i^an^a 第10頁 1232390 五、發明說明(7) ' ------ (8):貝料搜集知識庫Base for Search) ’·相關網 頁之内各經自動分析後,可能得到新概念,而被加入於 企業專屬概念資料庫中,系統因此生生不息,企業也將 因此而增益其知識。 本系統之運作方式詳細敘述如下: 本發明中企業資料搜尋概念初次形成,可由企業提供一 些相關網頁範本,本祕將自鮮取其帽包含的概念詞 (rm)並將這些概念同以叢集法歸為數類,每 1〇 -類別的το素可視為—串列之下義詞伽⑽㈣,每個元素 (概念詞)均有其觀,可以由其出現鮮決定之。每一類別 代表企業資料需求的一個構面(Facet),稱之為-個上義詞 (Hypemym)。母個上義同亦將賦予一個權重指數。權重指數 可由人工給定,或可由系統自動賦予。自動賦予之權重可 15由系統依每個上義詞所屬下義詞成員個數之比例來決定, 或依其它數學公式(如Analytie Hi_hy p_s,詹)計算Shelf attributes. The information needs represented by the knowledge framework form the query concept (also called query basis) of the present invention. (3) Internet / Web pages (Internet, Web Pages); The web page fetcher according to the universal resource locator (Uniform Resource Locator, URL) from the site in order to hold one by one web pages. ⑷Text Processor; perform preliminary processing on each * .html building's webpage, and remove tags. (5) Match Program; The comparison program scores 10 (Support) on the support of web pages and query concepts according to the "ffeaturc Analysis Method". The search and comparison of web pages will be automatically performed continuously in the background. Retrieved Pages; each page will be sorted according to its support for a particular search concept, and report results will be submitted regularly for reference by the producer. () Interesting Web sites / data; related web pages will be stored in the relevant web pages / website database. i ^ an ^ a Page 10 1232390 V. Description of the invention (7) '------ (8): Base for Search)' · Relevant web pages can be obtained after automatic analysis of each The new concept is added to the enterprise-specific concept database, so the system is endless, and the company will gain knowledge. The operation method of this system is described in detail as follows: The concept of enterprise data search in the present invention is first formed, and some related webpage templates can be provided by the company. The secretary will take the concept words (rm) contained in the cap and use the cluster method together. It is classified into several categories, and each 10-category το prime can be regarded as the meaning word Gaya below the series. Each element (concept word) has its own view, which can be determined by its appearance. Each category represents a facet of the company's data needs (Hypemym). Parents with the same meaning will also be given a weight index. The weight index can be given manually or automatically assigned by the system. The weight given automatically can be determined by the system according to the proportion of the number of members of each synonym, or by other mathematical formulas (such as Analytie Hi_hy p_s, Zhan)

1232390 五、發明說明(8) 求’即是特徵分析法之基礎 本發明之新顆性在於揭不^ —種系統,可依企業特殊要 求而進行資料之搜集。該系統所提供之界面,讓企業將其 特殊之資訊需求轉為資料搜集命令。然後,資料之搜集便 成為一種經常性之工作,無需人工之介入。 本發明特殊之處有二:其一為系統的查詢依據來自一些 頗完整且有組織的企業專屬概念資料庫,而查詢概念比對 10以「知識框架」表示;其二為本發明之比對程式,係根據 申請人所揭示之「特徵分析法(Feature Analysis Method)」而 設計。此種查詢方式,與一般搜尋引擎利用輸入的零星片 斷關鍵字’或是一般以關鍵字為基礎之搜尋引擎,有極大 之差異性。 15 本發明之優點可以由測試結果得知,企業專屬搜尋引 擎之檢索精確率(Precision)頗高,可完成企業所交付之高雞1232390 V. Description of the invention (8) Seeking 'is the basis of the feature analysis method. The new feature of the present invention is to uncover a system that can collect data according to the special requirements of the enterprise. The interface provided by the system allows companies to turn their special information needs into data collection commands. Then, the collection of information becomes a regular task without manual intervention. The invention has two special features: one is that the system's query basis is from some fairly complete and organized enterprise-specific concept databases, and the query concept comparison 10 is represented by a "knowledge frame"; the second is the comparison of the invention The program is designed according to the "Feature Analysis Method" disclosed by the applicant. This query method is very different from the general search engine's use of sporadic keywords or search engines based on keywords. 15 The advantages of the present invention can be learned from the test results. The search precision of the company's exclusive search engine is quite high, which can complete the high chicken delivered by the company.

第12頁 1232390Page 12 1232390

度資料搜尋任務。 本發明所欲解決之問題如下· (1) 任-企業、峻、甚轉某—個專業研究者,其資訊之 5需求度摊有其特殊性、專Pm,而-般性類型之其它資 δ嶋父不會被4視。例如,某資訊媒體公司為維護其權益, 有思在網路上尋找其出版品是否發生被盜版現象 。但是利 用般之搜尋引擎,無法獲得理想之結果;即令運用出版 品名稱進行搜尋,極其勞神費時也祇能勉強找到一二。 10 (2) 目前人們使用-般的搜尋引擎,在進行賴搜尋時往往 以點選資料項或輸人瞻字方式進行;可是利關鍵字實 在難以精確描述-些特殊之資訊需求,因為關鍵字(及其邏 輯組合)所代表的搜尋概念往往是零星、不完整的。例如, 15有意在網路上找尋販售盜版光碟之網站,運用關鍵字進行 搜尋,可能大失所望,導致毫無收穫。因此,設計一種適 當的知識表達方式是必需的。 1232390 發明說明(10) (3)既織尋概念有其特殊之表達方式,網頁内容之比對亦 應有特殊之方法。因此,設計—麵穎輯松是必需的。 及網頁内 综上所述,本發明涉及新穎網頁内容與查詢概念比對 5之方法;其技術範疇包括程式設計,知識表達, 容評量方法等。 企業之資訊需求有其特殊性,若想從網路得到此類資 訊,亦需要特殊之搜集方法,本申請案之資料搜集是自動 的,無需使用者之操作。系統之輸入為企業專屬概念資料 庫中之知識,輸出為最具參考價值之相關網頁。本發明之 工作方式如下:網頁擷取器按通用資源定位器(咖_Degree data search task. The problems to be solved by the present invention are as follows: (1) Ren-enterprise, firm, and even a certain professional researcher, the 5 degree of demand for information has its specificity, specialized Pm, and-other types of general resources δ Uncle is not considered by 4. For example, in order to protect its rights and interests, an information media company is thinking about whether its publications have been pirated on the Internet. However, using a general search engine cannot achieve the desired results; even if the search using the publication name is extremely laborious and time-consuming, it can barely find one or two. 10 (2) At present, people use ordinary search engines, which are often performed by clicking on data items or inputting words during search. However, it is difficult to accurately describe profitable keywords-some special information needs, because keywords The search concept represented by (and its logical combination) is often sporadic and incomplete. For example, 15 interested in finding websites that sell pirated optical discs on the Internet, using keywords to search, may be disappointed and lead to nothing. Therefore, it is necessary to design an appropriate expression of knowledge. 1232390 Description of the invention (10) (3) Not only does the concept of weaving search have its own special expression, the comparison of webpage content should also have a special method. Therefore, design-face Yingji Song is necessary. And in the webpage In summary, the present invention relates to a method for comparing novel webpage content and query concepts 5; its technical scope includes programming, knowledge expression, capacity evaluation methods, and the like. The information needs of enterprises have their particularities. If you want to obtain such information from the Internet, you also need a special collection method. The data collection in this application is automatic and does not require user operation. The input of the system is the knowledge in the enterprise-specific concept database, and the output is the relevant webpage with the most reference value. The working method of the present invention is as follows:

Re_ce Locator,url)從該站上擷取一個網頁,企業專屬搜 尋引擎(BSE)讀料庫巾之知識_為麵,分析該網頁與 15企業資訊需求之相關程度,並賦予—權重。於記錄該權重 後’刪除該網頁,然後再依序擷取該站其它之網頁。本發 分析網頁内容之方法,是基於申請人所揭示之「特 第14頁 1232390Re_ce Locator (url) retrieves a web page from the site, and the knowledge of the company ’s exclusive search engine (BSE) reading library is used to analyze the correlation between the web page and the 15 enterprise information needs, and give-weight. After recording the weight, delete the webpage, and then retrieve the other webpages of the site in sequence. The method of analyzing the content of this page is based on the "Special Page 14 1232390" disclosed by the applicant.

徵分析法(Feature Analysis Meth〇d)」而設計的。網頁之搜尋 與比對工作將在後台不斷執行,並定期提出結果報告。在 結果報告中,系統將依權重對各網頁進行排序,以提供企 業參考。被使用者認為相關之網頁將被存於相關網頁資料 5庫。相關網頁之内容經自動分析後,可能得到新概念,而 被加入於企業專屬概念資料庫中,系統因此生生不息,企 業也將因此而增益其知識。 本發明之實用性在於能解決企業之特殊資訊需求。例 10如,某身訊媒體公司想在網路上尋找有無被盜版之出版品, 以維護其權益。若想利用一般之搜尋引擎來完成這項工作, 實非易事;即令能勉強找到一二,亦極其勞神費時。反之, 使用本發明所設計之系統,則可幾近完全自動的完成大部 份工作。 15 本發明之「企業專屬搜尋引擎(Business Search Engine, BSE)」設計。它能夠具備下列特色;Feature Analysis Method ". The search and comparison of web pages will be performed continuously in the background, and the results reports will be submitted regularly. In the result report, the system will sort each web page by weight to provide corporate reference. The web pages deemed relevant by the user will be stored in the relevant webpage database 5. After the content of related webpages is automatically analyzed, new concepts may be obtained and added to the company's proprietary concept database. The system is therefore endless and the company will gain knowledge. The practicality of the present invention is that it can solve the special information needs of enterprises. Example 10: For example, a body media company wants to find online pirated publications to protect its rights. It is not easy to accomplish this task with a common search engine; even if you can barely find one or two, it is extremely laborious and time consuming. Conversely, using the system designed by the present invention, most of the work can be done almost completely automatically. 15 The "Business Search Engine (BSE)" design of the present invention. It can have the following characteristics:

Ms 第15頁 1232390 五、發明說明(12) (1)輔助企業儘可能週延、完整及有時的描述其特殊資訊需 求; (2) 提供一種分析資料可用性之方法; (3) 協助企業管理所搜集到的有用資料,以形成企業之智慧 5資產;儘可能完整記錄相關網站網址,以利後續之搜尋活 動,並經由所得資料分析,形成新穎資訊需求,以作為後 續搜尋之依據。 技術内谷、特點及功效 10 企業專屬概念資料庫,為搜尋引擎配上企業特殊資訊需 求所需之專業資料庫。本發明將以如下之知識框架描述此 一資訊需求:Ms Page 15 1232390 V. Description of the invention (12) (1) Assist the enterprise to describe its special information needs as long as possible, complete and sometimes; (2) Provide a method to analyze the availability of data; (3) Assist the enterprise management office Collect useful information to form the company's wisdom 5 assets; record the relevant website URLs as completely as possible to facilitate subsequent search activities, and analyze the obtained data to form new information needs as the basis for subsequent searches. Technology Intrinsic, Features and Functions 10 Enterprise-specific concept database, which provides search engines with professional databases required by the company's special information needs. The present invention will describe this information requirement in the following knowledge framework:

1232390 五、發明說明(13) -- 此-知識框架中所指「alist〇ftenns(一串列概念詞)」形成 -個語義場,由其上義作為屬性(attribute)。 ⑴-個(查詢)概念將分上義詞及語義場兩層而賦予權重··第 5 一層權重代表上義詞對查詢概念重要性之比率;而第二 層為語義場權重代表各語義場内所有概念詞之權重之總 和。 (2)網頁摘取器按使用者指定之網址⑽池加Res〇urce 10 ,將站上之網頁逐一下載,並交與比對程 式進行權值評量。 本發明之比對私式,係根據申請人所揭示之「特徵分析 法(Feature Analysis Method)」而設計的,配合上述之知識框 15架結構,一個網頁對查詢概念網頁支持度之評量可由網頁 之累積權重來評量。累積權重之定義為上義詞之權重乘以 其所屬語義場内各概念詞之權重總和。累積權重在一定門 iraHl^n 第17頁 12323901232390 V. Description of the invention (13)-The "alist〇ftenns (a series of concept words)" referred to in this knowledge frame forms a semantic field, with its meaning as an attribute. The 个-(concept) concept will be divided into two layers of the semantic word and the semantic field and give weight. The 5th layer of weight represents the ratio of the significance of the semantic word to the query concept. The sum of the weights of all concept words. (2) The web page extractor downloads the web pages on the website one by one according to the URL specified by the user, and adds them to the comparison program for weight evaluation. The comparison type of the present invention is designed according to the "Feature Analysis Method" disclosed by the applicant, and with the above 15 knowledge frame structure, the evaluation of a webpage for the query concept webpage can be evaluated by The cumulative weight of the webpage is measured. Cumulative weight is defined as the weight of the synonym multiplied by the sum of the weights of the conceptual words in its semantic field. Cumulative weight is at a certain gate iraHl ^ n Page 17 1232390

根值以上者,鶴蝴(Relevant)。 累積權重之演算法 假設一個查詢概念含有m個語義場,其對應之上義詞 的權重分別為a(m)。假設每個語義場由n(m)個概念詞組成, 則一個網頁對此查詢概念之相關程度可由其累積權重計算 而知: (#1)累積權重=〇 ; /*初值*/ (#2) for i = 1 to m (#3) 語義場W權重=0;/*初值*/ (#4) f〇rj = l to n(m) /*計算第i個語義場之權重*/ (#5) 語義場W權重=語義場⑴權重+ 語義場(i)之第個j概念詞權重*網 頁内第j個概念詞出現次數; (#6) nextj (#7)累積權重=累積權重+a(i)*語義場(i)權重; 1232390 五、發明說明(15) (#8) next i; 〇 (#9)輸出累積權重 ⑶一個網頁之相關性及網址將被記下。在完成比對之後, 该網頁隨即被刪除。而網頁擷取器則將擷取下一網頁, 繼續比對。網頁之擷取及比對可在後台執行,定期報告 搜尋結果。使用者不必在線上等待。甚至於可於傍晚下 班前啟動系統,而於次日上班時察看結果。 15 (4)經由系統判斷為相關之一網頁,其位址將自動被記錄於 #料庫中,以供後續之參考。而該網頁所屬之網站,其 網址亦將被記錄於資料庫中,作為後續資料搜尋之依據。 一個經確認為有參考價值之網頁内容,可能於分析之後, 得出新的搜尋概念。此搜尋概念將加入於語料庫中,形 成新的搜尋依據。如此,系統可不斷演進,生生不息。 第19頁 1232390 五、發明說明(16) 實施例 以實施例說明運用本發明之企業專屬搜尋引擎系統,及 其它通用搜尋引擎之初步測試經比較後,得到如下結果。 假設搜尋目標為「販售盜版光碟」之網頁。由(刑事局) 5專家提供個範本網頁,經分析得概念詞總數879個,茲 舉其中出現頻率最高之十個概念詞及其權重值,並以其括 號内權重依序排列如下: CD (10·19)、精選(5·8〇)、片裝(5·79)、 一片(3,46)、選輯(3.05)、中文版(2.25)、 大碟(2·15)、精選輯(2.03)、VCD (2·〇2)、 光碟(1.93)0 利用以上十個概念詞及其邏輯組合,在一般以概念詞 1° 為基礎的通用搜尋引擎上尋找「販售盜版光碟」之網頁, 結果系統回應是數十個,,CD,,公司、光碟或音樂網站,但 與「販售盜版光碟」無關,原因是「CD、精選、片裝」 等概念詞佔極高之權重。Above the root value, Crane Butterfly (Relevant). Algorithm for accumulating weights Suppose a query concept contains m semantic fields, and the weights of the corresponding semantics are a (m). Assuming that each semantic field is composed of n (m) concept words, the relevance of a webpage to this query concept can be calculated from its cumulative weight: (# 1) cumulative weight = 0; / * initial value * / (# 2) for i = 1 to m (# 3) semantic field W weight = 0; / * initial value * / (# 4) f〇rj = l to n (m) / * calculate the weight of the ith semantic field * / (# 5) Semantic field W weight = semantic field ⑴ weight + semantic field (i) the first j concept word weight * the number of occurrences of the jth concept word in the web page; (# 6) nextj (# 7) cumulative weight = Cumulative weight + a (i) * semantic field (i) weight; 1232390 V. Description of the invention (15) (# 8) next i; 〇 (# 9) Output cumulative weight ⑶ The relevance and URL of a web page will be recorded . After the comparison is complete, the page is deleted. The web crawler will fetch the next web page and continue to compare. Web page capture and comparison can be performed in the background, and search results are reported regularly. Users don't have to wait online. You can even start the system before work in the evening and check the results the next day at work. 15 (4) The address of a webpage judged to be relevant by the system will be automatically recorded in # 料 库 for subsequent reference. For the website to which the webpage belongs, its URL will also be recorded in the database as a basis for subsequent data search. The content of a web page that has been identified as a reference value may be analyzed to arrive at a new search concept. This search concept will be added to the corpus to form a new search basis. In this way, the system can evolve continuously and endlessly. Page 19 1232390 V. Description of the invention (16) Examples The following examples are used to explain the preliminary tests of the enterprise-specific search engine system using the present invention and other general search engines after comparison. Assume that the search target is a page that sells pirated discs. A template webpage provided by (criminal bureau) 5 experts. After analysis, a total of 879 concept words are listed. The ten most frequently used concept words and their weight values are listed. The weights in parentheses are listed in order: CD ( 10 · 19), Featured (5 · 80), Pack (5 · 79), One (3,46), Selected (3.05), Chinese (2.25), Album (2 · 15), Selected ( 2.03), VCD (2.02), optical disc (1.93) 0 Using the above ten conceptual words and their logical combinations, find a web page for "selling pirated discs" on a general search engine based on the conceptual word 1 ° As a result, the system responded to dozens of CDs, companies, CDs, or music websites, but it had nothing to do with "selling pirated CDs" because the concepts of "CD, select, film" and so on took extremely high weight.

1232390 五、發明說明(17) 在-個約含韻個網頁之測試集中,進行檢索「販售 盜版光碟」相_狀高難度鱗贿。其枝分述如 下: (甲組):利用此879個概念詞其權重,進行以出_率為 基礎檢索,結果在累重為1474時,得到檢出率為 58观,精確率為α76% ;亦即約每檢索細頁中含有 一頁「販售盜版光碟」相關網頁。由於精確率太低不 符實務之用。 以 (乙組)··運用本發明之企業專屬搜尋引擎系、统進行搜尋, 該「企業專屬搜尋卿」貞彳會將社879個触成四個語 義場,由(四個)上義詞關聯之。將概念詞及其權重值, 其括號内權重依序排列如下: (1)·媒體技術: VCD(2.02) 燒錄(U)、 破解(0·25) CD(^ A^(2J5) 光碟(1·93)、電影(U8) 原版(0·26)、音樂(0·20) 歌曲(0·32)、· · · 第21頁 1232390 五、發明說明(18) (2).交易·. 註冊(ί.5)、 電話(0.58)、 訂構(0.34)、 訂單(0.33)、 貨款(0.29)、 郵遞區號¢0.25)、 特價(0.22)、 網址(0.19)、 價格(0.15)、 劃撥(0.03)、· • · ^ ......一.....--'·- (3)·廣告行銷: 精選(5.80)、 丨選輯(3.05)、 中文版(2.25)、 精選·03)、 :^(1^^ 正式版(1.71)、 金片(0.79)、 最新(0·67)、 珍藏(0.54)、 泡麵(0.51)、 補帖(G.10)、 便宜(0.03)、· · · (4).數量: 片裝(5.79)、 一片(3.46)、 1兩片(1.72)、 一套(0.27)、 全套(0J3)、 • · * 以下為兩組實驗之數據: (曱组)在無任何專家知識之情況下,將四個語義場每個平均 賦予權重25% ’而制用本祕以「特徵分析法(F_re 1〇細1娜齡0Φ」對網頁進行分析。結果在累積權重為12·86 ##到檢出率4^2.37^^確率為8 72% ,·亦即約每檢索 ιμηι ηκβ -——— 1232390 " ----------— _ 五、發明說明(19) —------ 出12頁中卻含有一頁「販售盜版光碟」相關網頁。 (乙組)若對系統再輸入30個新的範本網頁,亦即使其經過 度反饋學習,四個語義場權重將被調整成:媒體技術 5 (22.53%)、交易(28·28%)、廣告行銷(3〇 83%)、數量(18 36)。 結果在累積權重為13.83時,得到檢出率為47·54%,精確率 為13/72%;亦即約每檢索出8頁中即含有一頁「販售盜版光 碟」相關網頁。必須注意:於反饋學習過程中各概念詞之權 重亦有些微變動。 10 對如此高難度、智慧型之搜尋任務,此精確率已可有效支 持實務之用。 綜上所述,本發明具備原創性、新穎性及進步性。雖然 15本發明以一些較佳實施例揭露如上,然其並非用以限定本 發明,任何熟習此技術者,在不脫離本發明之精神和範圍1232390 V. Description of the invention (17) In a test set of about a webpage containing rhymes, search for the "sale of pirated discs" photo_ status difficult scale bribes. The branches are described as follows: (Group A): Using this 879 concept words and their weights, the search is based on the output rate. When the cumulative weight is 1474, the detection rate is 58 views, and the accuracy rate is α76%. In other words, each search page contains a page related to "Selling Pirated CDs". The accuracy is too low for practical use. (Group B) ·· Using the enterprise-specific search engine system of the present invention to conduct a search, the "enterprise-exclusive search secretary" Zhen Zhen will touch the 879 companies into four semantic fields. Link it. The concept words and their weight values are arranged in order in the brackets as follows: (1) · Media technology: VCD (2.02) Burn (U), Crack (0 · 25) CD (^ A ^ (2J5) CD ( 1.93), movie (U8) original (0 · 26), music (0 · 20) song (0 · 32), ··· page 21 1232390 V. invention description (18) (2). Transaction ·. Registration (ί.5), phone (0.58), subscription (0.34), order (0.33), payment (0.29), postal code (¢ 0.25), special price (0.22), website (0.19), price (0.15), transfer (0.03), • • ^ ...... a .....--'... (3) · Advertising Marketing: Featured (5.80), 丨 Selected (3.05), Chinese (2.25), Featured · 03),: ^ (1 ^^ official version (1.71), gold piece (0.79), latest (0.67), treasure (0.54), instant noodles (0.51), patch (G.10), cheap ( (0.03), ··· (4). Quantity: Pack (5.79), one (3.46), one two (1.72), one set (0.27), full set (0J3), • * * The following are the two sets of experiments Data: (Group A) Without any expert knowledge, each of the four semantic fields is given an average weight of 25%. Age 0Φ "to analyze the web page. The result is that the cumulative weight is 12 · 86 ## to the detection rate of 4 ^ 2.37 ^^ and the accuracy rate is 8 72%, that is, about every search ιμηι ηκβ -———— 1232390 "- --------— _ V. Description of the invention (19) —----- There are 12 pages containing a page about "Selling Pirated CDs" (Group B) If you enter the system again The 30 new template webpages will adjust the four semantic field weights even if they are learned through feedback: media technology 5 (22.53%), transactions (28 · 28%), advertising marketing (3083%), The number is (18 36). As a result, when the cumulative weight is 13.83, the detection rate is 47.54%, and the accuracy rate is 13/72%. "Relevant web pages. It must be noted that the weight of each concept word has changed slightly during the feedback learning process. 10 For such a difficult and intelligent search task, this accuracy rate can effectively support practical use. In summary, The invention is original, novel, and progressive. Although the invention is disclosed above with some preferred embodiments, it is not intended to limit the invention Inventions, anyone skilled in the art without departing from the spirit and scope of the invention

内,當可作為些許之更動與潤飾,因此本發明之保護範圍In addition, it can be used as a slight modification and retouch, so the scope of protection of the present invention

第23頁 1232390 五、發明說明(20) 當視後附之申請專利範圍所界定為準。Page 23 1232390 V. Description of the invention (20) The scope of the attached patent application shall prevail.

第24頁 1232390 圖式簡單說明 圖式說明: 圖一、企業專屬搜尋引擎之系統架構 5 圖號說明: 1 ......•資訊需求(Information Need) 2 ......•查詢概念(Search Concepts) 3 ......•網際網路/網頁(Internet,Web Pages) ίο 4......•文件處理器(Text Processor) 5 ......•比對程式(Match Program) 6 ......•搜尋到的網頁(Retrieved Pages) 7 ......•相關網站/網頁(Interesting Web Sites/ Data) 8 ......•資料搜集知識庫(Knowledge Base for Search) 15 11......抽象化 13…….由網路資料擷取器擷取網頁 15···.計算累積權重後排序 第25頁 1232390 圖式簡單說明 16……·網頁相關性檢驗 17……·萃取關鍵詞形成新查詢概念 5 10Page 24 1232390 Schematic description Schematic description: Figure 1. System architecture of enterprise-specific search engine 5 Description of drawing numbers: 1 ...... • Information Need 2 ...... • Query Concept (Search Concepts) 3 ...... • Internet / Web Pages (Internet, Web Pages) ίο 4 ...... • File Processor (Text Processor) 5 ...... • Comparison Program (Match Program) 6 ...... • Retrieved Pages 7 ...... • Interesting Web Sites / Data 8 ...... • Data Collection Knowledge Base for Search 15 11 ... Abstraction 13 ... Web pages retrieved by network data extractor 15 ... Sorted after calculating cumulative weights Page 25 1232390 Illustration 16 …… · Webpage relevance test 17 …… · Extract keywords to form a new query concept 5 10

第26頁Page 26

Claims (1)

10 15 1232390 六、申請專利範圍 L 一種企業專屬搜尋引擎系統,其係包括網頁擷取器 (Web Crawler)、比對程式(Match Program)、企業 專屬概念資料庫(Concept Database),及相關網 站 / 網頁資料庫(Database of Interesting Web 5 Sites/Pages)等四個構件; 其中之特徵在於,企業專屬概念資料庫係以「知 識框架(KnowledgeFrame)」表達企業之資訊需求; 而比對程式係依特徵分析法(Feature kiysk Method)建構而成。 2·如申請專利範圍第1項之企業專屬搜尋引擎系統, 其中之知識框架係描述如下所述之資訊需求· Object_frame ={ attribute 1: a list of terms; attributej:alist of terms; attribute N: a list of terms; 此一知識框架中之attribute是查詢之構面,是 為一個上義詞;每一上義詞有一串列之下義詞。10 15 1232390 VI. Patent application scope L An enterprise-specific search engine system, which includes a Web Crawler, Match Program, Enterprise-specific Concept Database, and related websites / Database of Interesting Web 5 Sites / Pages and other four components; the characteristic of which is that the enterprise-specific concept database uses "KnowledgeFrame" to express the company's information needs; and the comparison program is based on characteristics Constructed by the Feature kiysk Method. 2. If the enterprise-specific search engine system of item 1 of the scope of patent application, the knowledge frame is the information requirement described as follows: Object_frame = {attribute 1: a list of terms; attributej: alist of terms; attribute N: a list of terms; The attribute in this knowledge framework is the facet of the query, which is a superordinate word; each superordinate word has a list of subordinate words. 第27頁 1232390 六、 申請專利範圍 3·如申請專利範圍第1項之企業專屬搜尋引擎系統, 其中之特徵分析法(Feature Analysis Method), 係設計比對程式之基礎;它將依網頁對查詢概念 之支持度進行評分。 5 4·如申請專利範圍第2項之企業專屬搜尋引擎系統, 其中之構面(Facet)係採用自動化之方法是叢集法 (Clustering),找出幾個語義場(Semantic Fields),構成該需求之幾個重要構面(Facet), 由上義詞(Hypernym)代表之。 ίο 5·如申請專利範圍第3項之企業專屬搜尋引擎系統, 其中之特徵分析法每個下義詞(概念詞)之權重可 由人工給定;或由系統自動賦予(如依出現頻率 來決定);每個上義詞之權重可由其所屬下義詞成 員個數之比例來決定,或依其它數學公式計算而 15 得。Page 27 1232390 VI. Scope of patent application 3. If the company's exclusive search engine system for item 1 of the patent scope is applied, the Feature Analysis Method is the basis of the design comparison program; it will be based on the webpage for query Scoring support for the concept. 5 4 · If the company's exclusive search engine system for the second item of the patent application, the facet (Clustering) method is used to find several semantic fields, which constitutes the demand. Several important aspects (Facet) are represented by the synonym (Hypernym). ίο 5. If the company's exclusive search engine system for item 3 of the patent application scope, the weight of each synonym (concept word) in the feature analysis method can be given manually; or automatically assigned by the system (such as determined by the frequency of occurrence) ); The weight of each synonym can be determined by the ratio of the number of members of the subordinate to which it belongs, or calculated by other mathematical formulas.
TW90101730A 2001-01-20 2001-01-20 Business search engine TWI232390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW90101730A TWI232390B (en) 2001-01-20 2001-01-20 Business search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW90101730A TWI232390B (en) 2001-01-20 2001-01-20 Business search engine

Publications (1)

Publication Number Publication Date
TWI232390B true TWI232390B (en) 2005-05-11

Family

ID=36320017

Family Applications (1)

Application Number Title Priority Date Filing Date
TW90101730A TWI232390B (en) 2001-01-20 2001-01-20 Business search engine

Country Status (1)

Country Link
TW (1) TWI232390B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI454945B (en) * 2010-08-10 2014-10-01 Brightedge Technologies Inc Search engine optimization at scale
TWI474196B (en) * 2007-04-02 2015-02-21 Microsoft Corp Search macro suggestions relevant to search queries
CN113190739A (en) * 2021-02-02 2021-07-30 北京比特易湃信息技术有限公司 System for self-defining business factors and suitable for user search query suggestion and spell check

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI474196B (en) * 2007-04-02 2015-02-21 Microsoft Corp Search macro suggestions relevant to search queries
TWI454945B (en) * 2010-08-10 2014-10-01 Brightedge Technologies Inc Search engine optimization at scale
CN113190739A (en) * 2021-02-02 2021-07-30 北京比特易湃信息技术有限公司 System for self-defining business factors and suitable for user search query suggestion and spell check

Similar Documents

Publication Publication Date Title
US12001490B2 (en) Systems for and methods of finding relevant documents by analyzing tags
Seymour et al. History of search engines
JP5620913B2 (en) Document length as a static relevance feature for ranking search results
Wu et al. Query selection techniques for efficient crawling of structured web sources
US8380721B2 (en) System and method for context-based knowledge search, tagging, collaboration, management, and advertisement
US8260771B1 (en) Predictive selection of item attributes likely to be useful in refining a search
US8352467B1 (en) Search result ranking based on trust
US9497277B2 (en) Interest graph-powered search
US20040049514A1 (en) System and method of searching data utilizing automatic categorization
US20050222987A1 (en) Automated detection of associations between search criteria and item categories based on collective analysis of user activity data
WO2014144905A1 (en) Interest graph-powered feed
Hasan et al. Query suggestion for e-commerce sites
Ali et al. An overview of Web search evaluation methods
CN109918563A (en) A method of the book recommendation based on public data
US9330071B1 (en) Tag merging
Fang et al. Discriminative probabilistic models for expert search in heterogeneous information sources
Bellogín et al. Information retrieval and recommender systems
Malhotra et al. A comprehensive review from hyperlink to intelligent technologies based personalized search systems
TWI232390B (en) Business search engine
Li et al. People search: Searching people sharing similar interests from the Web
Munilatha et al. A study on issues and techniques of web mining
Wu et al. A quality analysis of keyword searching in different search engines projects
Zhu et al. An Integrated Information Retrieval Framework for Managing the Digital Web Ecosystem
Al-Akashi Using Wikipedia Knowledge and Query Types in a New Indexing Approach for Web Search Engines
Ji et al. Exsearch: a novel vertical search engine for online barter business

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees