TWI497322B - The method of determining and using the method of web page evaluation - Google Patents

The method of determining and using the method of web page evaluation Download PDF

Info

Publication number
TWI497322B
TWI497322B TW098133414A TW98133414A TWI497322B TW I497322 B TWI497322 B TW I497322B TW 098133414 A TW098133414 A TW 098133414A TW 98133414 A TW98133414 A TW 98133414A TW I497322 B TWI497322 B TW I497322B
Authority
TW
Taiwan
Prior art keywords
webpage
evaluation value
same
search engine
engine server
Prior art date
Application number
TW098133414A
Other languages
Chinese (zh)
Other versions
TW201113726A (en
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to TW098133414A priority Critical patent/TWI497322B/en
Publication of TW201113726A publication Critical patent/TW201113726A/en
Application granted granted Critical
Publication of TWI497322B publication Critical patent/TWI497322B/en

Links

Description

網頁評估值的確定及運用的方法、系統Method and system for determining and using webpage evaluation value

本發明係有關資訊處理技術,特別有關一種利用電腦對網頁評估值進行確定及運用的方法、系統。The invention relates to information processing technology, and particularly relates to a method and system for determining and using webpage evaluation values by using a computer.

搜尋引擎從網際網路上抓取網頁,在用戶查詢時,會找到滿足用戶關鍵字的所有網頁,然後按照相關度排序,以便排在前面的搜尋結果更符合用戶的需求。由於相關度是一個非常複雜、基於很多參數計算出來的結果,因而也就存在著利用各種各樣的演算法和參數來計算相關度的技術方案,並且,一般來說各個搜尋引擎廠商用的參數和演算法也都各不相同。The search engine crawls the webpage from the Internet. When the user queries, it finds all the webpages that satisfy the user's keywords, and then sorts by relevance, so that the search results in front of the search are more in line with the user's needs. Since correlation is a very complex result calculated based on many parameters, there are technical solutions for calculating correlation using various algorithms and parameters, and generally, parameters used by various search engine vendors. And algorithms are also different.

例如,Google在1997年提出了Page rank這種提高相關度演算法的參數以及計算這個參數的演算法。大概可以這樣理解Page rank,重要網頁鏈結出去的目標網頁,會獲得重要的權值,被越多重要網頁指向的網頁,page rank越高,也就越重要。For example, in 1997, Google proposed the Page rank parameter that improves the correlation algorithm and the algorithm for calculating this parameter. It is possible to understand the Page rank in this way. The landing page of important web pages will get important weights. The more pages that are important to the web page, the higher the page rank, the more important it is.

現有搜尋引擎在處理內容類查詢詞的排序時,普遍解決的不夠好。重點體現在如下兩點:Existing search engines generally do not solve well when dealing with the sorting of content class query words. The focus is reflected in the following two points:

1、使用外鏈計算Page rank以判斷重要網頁的方式基本上不起作用,導致排在搜尋結果前面的結果,很大程度上並不是用戶最想看到的結果。1. Using the external chain to calculate the Page rank to determine the important webpages basically does not work, resulting in the results ranked in front of the search results, is largely not the result that users most want to see.

2、現有的搜尋引擎通常用排重技術處理內容相同或接近相同的網頁。例如,在抓取網頁時並不儲存某些內容重複的網頁,或在接收到用戶搜尋請求後不顯示某些內容重複的網頁或將該等內容排在搜尋結果的後面。如果沒有合適的鏈結資料,搜尋引擎有可能根據Page rank演算法將原創網頁忽略或排後,而把轉載的網頁排在前面。因此,現有的搜尋引擎並沒有考慮到內容相同的不同網頁對結果排序的影響。2. Existing search engines typically use weight-receiving techniques to process web pages of the same or nearly identical content. For example, when a webpage is crawled, some webpages with duplicate content are not stored, or after receiving a user search request, some webpages whose content is duplicated are not displayed or the content is ranked behind the search result. If there is no suitable link data, the search engine may ignore or post the original page according to the Page rank algorithm, and the reprinted page is ranked first. Therefore, the existing search engine does not take into account the impact of different web pages with the same content on the ranking of results.

本發明提供一種利用電腦對網頁評估值進行確定及運用的方法、系統,用以提高對查詢結果的回饋準確性。The invention provides a method and a system for determining and using a webpage evaluation value by using a computer, so as to improve the feedback accuracy of the query result.

本發明實施例中提供了一種利用電腦對網頁評估值進行確定方法,包括如下步驟:從搜尋引擎伺服器系統獲取具有相同或接近相同的內容的網頁;搜尋引擎伺服器系統確定所述各網頁的產生時間及第一評估值;搜尋引擎伺服器系統根據所述各網頁的第一評估值確定產生時間最早的網頁的第二評估值。The embodiment of the present invention provides a method for determining a webpage evaluation value by using a computer, comprising the steps of: obtaining a webpage having the same or nearly the same content from a search engine server system; and the search engine server system determines the webpage of the webpage. And generating a time and a first evaluation value; the search engine server system determines, according to the first evaluation value of each webpage, a second evaluation value of the webpage with the earliest generation time.

較佳地,所述具有相同或接近相同的內容的網頁包括數位指紋相同的網頁。Preferably, the webpage having the same or nearly the same content includes a webpage with the same digit fingerprint.

較佳地,所述獲取具有相同或接近相同的內容的網頁包括:獲取各網頁中非第一段和非最後一段的中間內容最長的段落或段落非第一句和非最後一句的最長句子,並產生數位指紋;根據數位指紋確定各網頁內容是否相同後獲取具有相同或接近相同的內容的網頁。Preferably, the obtaining the webpage having the same or nearly the same content comprises: obtaining the longest paragraph of the non-first paragraph and the non-last paragraph of each webpage or the longest sentence of the non-first sentence and the non-last sentence of the paragraph. And generating a digital fingerprint; determining, according to the digital fingerprint, whether the content of each webpage is the same, and obtaining a webpage having the same or nearly the same content.

較佳地,所述確定所述各網頁的產生成時間包括下列方式之一或者其組合:根據網頁統一資源定位符URL包含的時間確定;根據內容類網頁中的時間確定;根據抓取網頁的時間確定;根據最早將網頁收入索引的時間確定。Preferably, the determining the generation time of each webpage comprises one or a combination of the following manners: determining according to a time included in the webpage uniform resource locator URL; determining according to time in the content category webpage; Time determination; determined based on the time when the web page was first indexed.

較佳地,所述第二評估值大於所述第一評估值。Preferably, the second evaluation value is greater than the first evaluation value.

較佳地,所述第二評估值為與所述產生時間最早的網頁具有相同或接近相同的內容的網頁的第一評估值之和與第一加權係數的乘積加上所述產生時間最早的網頁的第一評估值與第二加權係數的乘積。Preferably, the second evaluation value is a product of a sum of a first evaluation value of a webpage having the same or nearly the same content as the webpage with the earliest generation time and a first weighting coefficient plus the earliest generation time The product of the first evaluation value of the web page and the second weighting coefficient.

較佳地,所述第一加權係數與所述第二加權係數的取值相同或不同。Preferably, the first weighting coefficient is the same as or different from the value of the second weighting coefficient.

較佳地,所述第一評估值為根據包括外鏈在內的資料而形成的評估值。Preferably, the first evaluation value is an evaluation value formed based on data including an outer chain.

本發明還提供了一種根據網頁評估值對搜尋查詢結果進行排序的方法,包括如下步驟:從搜尋引擎伺服器系統獲取查詢後的查詢結果;搜尋引擎伺服器系統根據各網頁的第一評估值及產生時間最早的網頁的第二評估值對查詢結果排序。The present invention also provides a method for sorting search query results according to webpage evaluation values, comprising the steps of: obtaining query results after querying from a search engine server system; and searching engine server system according to first evaluation values of each webpage and The second evaluation value of the web page that produces the earliest time sorts the query results.

較佳地,進一步包括:搜尋引擎伺服器在查詢結果中顯示每個網頁的轉載次數。Preferably, the method further includes: the search engine server displays the number of reprints of each webpage in the query result.

本發明還提供了一種利用電腦對網頁搜尋查詢結果進行排序的方法,其包括如下步驟:從搜尋引擎伺服器系統獲取具有相同或接近相同的內容的網頁;搜尋引擎伺服器系統確定所述各網頁的產生時間;搜尋引擎伺服器系統根據所述各網頁的產生時間的先後順序進行排序。The present invention also provides a method for sorting webpage search query results by using a computer, comprising the steps of: acquiring webpages having the same or nearly the same content from a search engine server system; and the search engine server system determining the webpages The generation time of the search engine server system is sorted according to the order in which the respective web pages are generated.

較佳地,進一步包括:搜尋引擎伺服器系統根據所述各網頁的產生時間以及外鏈資料進行排序。Preferably, the method further includes: the search engine server system sorts according to the generation time of the webpages and the external link data.

本發明提供了一種搜尋引擎伺服器系統,包括:爬蟲系統,用以獲取具有相同或接近相同的內容的網頁;索引系統,用以確定各網頁的產生時間及各網頁的第一評估值,並根據各網頁的第一評估值確定產生時間最早的網頁的第二評估值。The present invention provides a search engine server system, including: a crawler system for acquiring web pages having the same or nearly the same content; an indexing system for determining the generation time of each web page and the first evaluation value of each web page, and A second evaluation value of the web page with the earliest generation time is determined according to the first evaluation value of each web page.

較佳地,所述索引系統進一步用以根據網頁的數位指紋確定各網頁是否具有相同或接近相同的內容。Preferably, the indexing system is further configured to determine whether each webpage has the same or nearly the same content according to the digital fingerprint of the webpage.

較佳地,所述索引系統包括:數位指紋產生單元,用以獲取各網頁中非第一段和非最後一段的中間內容最長的段落或段落非第一句和非最後一句的最長句子,並產生數位指紋;比較單元,用以根據數位指紋確定各網頁內容是否相同;獲取單元,用以根據數位指紋確定各網頁內容是否相同後,獲取具有相同或接近相同的內容的網頁。Preferably, the indexing system comprises: a digital fingerprint generating unit, configured to obtain the longest sentence of the non-first sentence and the non-last sentence of the non-first segment and the non-last segment of each webpage, and the longest sentence of the non-first sentence and the non-last sentence, and And generating a digital fingerprint; the comparing unit is configured to determine whether the content of each webpage is the same according to the digital fingerprint; and the acquiring unit is configured to obtain, according to the digital fingerprint, whether the webpage content is the same, and obtain the webpage having the same or nearly the same content.

較佳地,所述索引系統進一步用以根據下列方式之一或者其組合確定網頁產生時間:網頁統一資源定位符URL包含的時間;內容類網頁中的時間;抓取網頁的時間;最早將網頁收入索引的時間。Preferably, the indexing system is further configured to determine a webpage generation time according to one or a combination of the following manners: a time when the webpage uniform resource locator URL is included; a time in the content category webpage; a time of crawling the webpage; The time of the income index.

較佳地,所述索引系統進一步用以根據各網頁的第一評估值確定產生時間最早的網頁的第二評估值時,確定所述第二評估值為與所述產生時間最早的網頁具有相同或接近相同的內容的網頁的第一評估值之和與第一加權係數的乘積加上所述產生時間最早的網頁的第一評估值與第二加權係數的乘積。Preferably, the indexing system is further configured to determine, according to the first evaluation value of each webpage, that the second evaluation value of the webpage with the earliest generation time is the same as the webpage with the earliest generation time Or the product of the sum of the first evaluation values of the web pages close to the same content and the first weighting coefficient plus the product of the first evaluation value and the second weighting coefficient of the web page of the earliest generation time.

較佳地,所述索引系統進一步用以根據包括其他網頁指向在內的資料而形成的評估值確定各網頁的第一評估值。Preferably, the indexing system is further configured to determine a first evaluation value of each webpage based on the evaluation value formed by the data including the other webpage pointing.

較佳地,所述索引系統還用以根據各網頁的第一評估值及第二評估值對查詢結果排序。Preferably, the indexing system is further configured to sort the query results according to the first evaluation value and the second evaluation value of each webpage.

較佳地,所述索引系統進一步用以在查詢結果中顯示每個網頁的轉載次數。Preferably, the indexing system is further configured to display the number of reprints of each webpage in the query result.

本發明還提供了一種搜尋引擎伺服器系統,包括:爬蟲系統,用以從搜尋引擎伺服器系統獲取具有相同或接近相同的內容的網頁;搜尋引擎伺服器,用以搜尋引擎伺服器系統確定所述各網頁的產生時間,並根據所述各網頁的產生時間的先後順序進行排序。The present invention also provides a search engine server system, comprising: a crawler system for acquiring web pages having the same or nearly the same content from a search engine server system; and a search engine server for searching the engine server system for determining The generation time of each web page is described, and the order is sorted according to the order in which the web pages are generated.

較佳地,搜尋引擎伺服器進一步用以根據所述各網頁的產生時間以及外鏈資料進行排序。Preferably, the search engine server is further configured to sort according to the generation time of the webpages and the external link data.

本發明有利效果如下:在本發明實施中,首先獲取具有相同或接近相同的內容的網頁;然後確定各網頁的產生時間及評估值;最後再根據各網頁的評估值確定產生時間最早的網頁的評估值。The advantageous effects of the present invention are as follows: in the implementation of the present invention, webpages having the same or nearly the same content are first obtained; then, the generation time and evaluation value of each webpage are determined; and finally, the webpage with the earliest generation time is determined according to the evaluation value of each webpage. The assessed value.

由於在方案中通過對產生時間這一參數考慮到了網頁是否為原創,從而確定了與產生時間為依據的、判斷網頁實際評估值的方案,因此克服了在使用外鏈計算Page rank以判斷重要網頁的方式時,導致排在搜尋結果前面的結果並不能代表其評估值的問題。In the scheme, by considering whether the webpage is original or not by the parameter of the generation time, the scheme for judging the actual evaluation value of the webpage based on the generation time is determined, thereby overcoming the calculation of the page rank to determine the important webpage by using the outer chain. In the case of the way, the result that is ranked in front of the search results does not represent the problem of its evaluation value.

進一步的,還充分利用了內容相同的不同網頁之間的評估值之間的關係,並將其用於改進搜尋結果的排序,因此提高了查詢結果回饋的準確性。Further, the relationship between the evaluation values between different web pages with the same content is fully utilized, and is used to improve the ranking of the search results, thereby improving the accuracy of the query result feedback.

下面結合附圖對本發明具體實施方式進行說明。The specific embodiments of the present invention will be described below with reference to the accompanying drawings.

發明人在發明過程中注意到:The inventor noticed during the invention:

1、內容類網頁往往外鏈很少,因此使用外鏈計算Page rank以判斷重要網頁的方式基本上不起作用,從而導致排在搜尋結果前面的結果,很大程度上並不是用戶最想看到的結果。1. Content-based web pages tend to have few external links. Therefore, the use of external links to calculate Page ranks to determine important web pages basically does not work, resulting in results that are ranked in front of search results, and are largely not what users want most. The result.

2、對於內容相同的不同網頁,搜尋引擎都把他們當做干擾搜尋結果的負面因素,要麼被搜尋引擎直接扔掉,要麼將page rank降的很低。但其實這些內容相同的不同網頁,對於改進搜尋結果排序具有非常重要的作用。2. For different web pages with the same content, the search engine treats them as a negative factor in the search results, either thrown away by the search engine or the page rank is lowered. But in fact, these different pages with the same content have a very important role in improving the ranking of search results.

有鑒於此,本發明提出了為搜尋引擎的結果排序增加一個新的、重要的排序參數,大幅度地提高內容類查詢詞的搜尋結果效果的技術方案。使得網頁搜尋中找文章的查詢結果滿意度大幅度地提高。下面先對網頁評估值的確定實施方式進行說明,在對將該網頁評估值運用於返回查詢結果以提高搜尋準確性的實施方式進行說明。In view of this, the present invention proposes a technical solution for adding a new and important sorting parameter to the ranking of the search engine and greatly improving the search result of the content query word. The satisfaction of the query results of the articles found in the web search is greatly improved. The following describes the implementation manner of determining the evaluation value of the webpage, and describes the implementation manner of applying the evaluation value of the webpage to returning the query result to improve the search accuracy.

實施中,借用Google對某個網頁的重要性評估的評估值Page Rank的概念,將本發明中網頁的評估值稱為Copy rank,其代表了一種用以改進搜尋引擎相關度排序的參數和產生這個參數的演算法,適用於最佳化內容類查詢的搜尋結果排序。它利用網際網路上文章的轉載次數,計算原創網頁的Copy rank,並對轉載網頁進行聚合。搜尋引擎在計算相關度時,綜合page rank、關鍵字匹配程度等傳統計算相關度的參數和Copy rank,一起計算出一個新的相關度值。在搜尋引擎顯示結果時,也顯示轉載數目,以幫助用戶最快判斷網際網路上符合此查詢詞的最佳結果。In the implementation, by using Google's concept of evaluating the importance of a page, the evaluation value of the page is called Copy rank, which represents a parameter and generation for improving the relevance ranking of the search engine. The algorithm of this parameter is suitable for sorting the search results of the optimized content class query. It uses the number of reprints of articles on the Internet, calculates the copy rank of the original web page, and aggregates the reprinted web pages. When calculating the relevance, the search engine calculates a new relevance value by combining the parameters of the traditional calculation relevance such as page rank and keyword matching degree with the Copy rank. When the search engine displays the results, the number of reprints is also displayed to help the user quickly determine the best results on the Internet that match the query.

圖1為Copy Rank在搜尋引擎結果中的效果示意圖,如圖所示搜尋結果,版本(轉載次數)越多的文章,越有可能是用戶想要看到的文章。Figure 1 is a schematic diagram of the effect of Copy Rank in the search engine results. The search results, the more versions (reloads), the more likely the article is the user wants to see.

Copy Rank的確定主要包括三個因素,一是判斷網頁內容是否基本相同;二是判斷網頁的真實發佈時間;三是判斷誰是原創網頁,下面進行說明。The determination of the Copy Rank mainly includes three factors, one is to judge whether the content of the webpage is basically the same; the second is to judge the real release time of the webpage; the third is to judge who is the original webpage, and the following description is made.

圖2為網頁評估值的確定方法實施流程示意圖,如圖所示,在進行評估值確定時可以包括如下步驟:步驟201、從搜尋引擎伺服器系統獲取具有相同或接近相同的內容的網頁;步驟202、搜尋引擎伺服器系統確定各網頁的產生時間及第一評估值;步驟203、搜尋引擎伺服器系統根據相同或接近相同的內容的各網頁的第一評估值確定產生時間最早的網頁的第二評估值。2 is a schematic flowchart of a method for determining a method for determining a webpage value. As shown in the figure, when determining the value of the evaluation, the method may include the following steps: Step 201: Obtain a webpage having the same or nearly the same content from the search engine server system; 202. The search engine server system determines a generation time of each webpage and a first evaluation value. Step 203: The search engine server system determines, according to the first evaluation value of each webpage of the same or nearly the same content, the first webpage with the earliest generation time Second evaluation value.

實施中,在步驟201中,具有相同或接近相同的內容的網頁包括數位指紋相同的網頁。In implementation, in step 201, a web page having the same or nearly the same content includes a web page having the same digit fingerprint.

則獲取具有相同或接近相同的內容的網頁,可以包括:從搜尋引擎伺服器獲取各網頁中非第一段和非最後一段的中間內容最長的段落或段落非第一句和非最後一句的最長句子,並產生MD5;根據數位指紋確定各網頁內容是否相同後獲取具有相同或接近相同的內容的網頁。The obtaining the webpage with the same or nearly the same content may include: obtaining, from the search engine server, the longest paragraph or paragraph of the non-first and non-last segments of each webpage, the longest non-first sentence and the non-last sentence Sentences, and generate MD5; according to the digital fingerprint to determine whether the content of each web page is the same, then obtain web pages with the same or nearly the same content.

MD5是message-digest algorithm 5(訊息-摘要演算法)的縮寫,被廣泛用於加密和解密技術上,它可以說是檔案的“數位指紋”。任何一個檔案,無論是可執行程式、圖像檔案、暫存檔案或者其他任何類型的檔案,也不管它體積多大,都有且只有一個獨一無二的MD5資訊值,並且如果這個檔案被修改過,它的MD5值也將隨之改變。因此,實施中可以通過MD5來確定各網頁內容是否具有相同或接近相同的內容,亦即,通過對比同一檔案的MD5值來校驗這個檔案是否被“篡改”過。MD5的作用在於:當下載了檔案後,如果想知道下載的這個檔案和網站的原始檔案是否一樣,就可以給下載的檔案做個MD5校驗。如果得到的MD5值和網站公佈的相同,可確認所下載的檔案是完整的。如有不同,說明你下載的檔案是不完整的:要麼就是在網路下載的過程中出現錯誤,要麼就是此檔案已被修改。一般正規的站點,都會提供檔案md5校驗碼。MD5 is an abbreviation of message-digest algorithm 5, which is widely used in encryption and decryption technology. It can be said to be the "digital fingerprint" of files. Any file, whether it is an executable program, an image file, a temporary file or any other type of file, and no matter how large it is, has only one unique MD5 information value, and if the file has been modified, it The MD5 value will also change. Therefore, in the implementation, the MD5 can be used to determine whether the content of each webpage has the same or nearly the same content, that is, whether the file is "tampered" by comparing the MD5 value of the same file. The role of MD5 is: After downloading the file, if you want to know whether the downloaded file and the original file of the website are the same, you can make an MD5 check for the downloaded file. If the MD5 value obtained is the same as that published on the website, you can confirm that the downloaded file is complete. If it is different, the file you downloaded is incomplete: either an error occurred during the download process or the file has been modified. Generally, the regular website will provide the file md5 check code.

判斷網頁內容是否相同,具體採用的辦法可以是在所有文章類網頁中尋找非第一段和非最後一段的中間最長段落,產生MD5作為網頁指紋,作為判斷相同的依據。對於只有兩個以內段落的文章,取段落非第一句和非最後一句的最長句子,產生MD5作為網頁指紋,作為判斷相同的依據。如果兩個網頁的網頁指紋一樣,則說明兩個網頁的整篇內容是相同的。To determine whether the content of the webpage is the same, the specific method may be to find the middle longest paragraph of the non-first paragraph and the non-last paragraph in all the article-type webpages, and generate MD5 as the webpage fingerprint as the basis for judging the same. For an article with only two paragraphs, the longest sentence of the non-first sentence and the non-last sentence is taken, and MD5 is generated as the webpage fingerprint as the basis for judging the same. If the webpage fingerprints of the two webpages are the same, then the entire content of the two webpages is the same.

具體實施中,尋找非第一段和非最後一段的中間最長段落,以及取段落非第一句和非最後一句的最長句子產生MD5作為網頁指紋,是因為發明人在發明過程中注意到:通常第一段和最後一段、第一句和最後一句被改動的頻率很高,並不能代表文章的真實內容,因此選用非第一段和非最後一段、非第一句和非最後一句來產生MD5。In the specific implementation, searching for the longest paragraph in the middle of the non-first and non-last paragraphs, and taking the longest sentence of the non-first sentence and the non-last sentence of the paragraph to generate MD5 as the webpage fingerprint is because the inventor noticed during the invention: usually The first and last paragraphs, the first sentence and the last sentence are frequently changed and do not represent the actual content of the article. Therefore, non-first and non-last paragraphs, non-first sentences and non-last sentences are used to generate MD5. .

實施中,可以通過MD5來判斷兩個檔案之間是否相同,本領域技術人員易知,當在執行步驟201獲取具有相同或接近相同的內容的網頁時,並不僅限於採用通過MD5判斷內容一致的方式,其他能夠比較出兩個網頁內容是否一致的技術手段均可採用,其最終目的在於當存在內容相同的不同網頁時,不會把他們當做干擾搜尋結果的負面因素來直接扔掉,並將其用於改進搜尋結果排序。In the implementation, it can be judged by MD5 whether the two files are the same. It is easy for those skilled in the art to know that when the webpages having the same or nearly the same content are acquired in step 201, it is not limited to determining that the content is consistent through MD5. In other ways, other technical means that can compare the content of the two web pages can be used. The ultimate goal is that when there are different web pages with the same content, they will not be thrown away directly as a negative factor that interferes with the search results. It is used to improve the ranking of search results.

在步驟202中,在確定網頁的產生時間時可以包括:根據網頁統一資源定位符URL包含的時間,和/或,根據文章類網頁中的時間確定網頁產生時間。In step 202, determining the time of generation of the webpage may include: determining the webpage generation time according to the time in the webpage-based webpage according to the time included in the webpage uniform resource locator URL.

實施中,判斷網頁的真實發佈時間,可以採用電腦程式抽取的方式獲得。由於目前大部分網站的網頁都是動態產生的,因而網頁伺服器返回的Last-modified(最後修改時間)欄位已經沒有什麼意義,因此可以從網頁正文等處抽取時間。抽取時間可以按以下演算法:首先判斷URL中是否含有時間,例如下面的一個例子中的URL(Uniform Resource Locator,統一資源定位符)中便含有時間:http://news.sina.com.cn/w/2009-01-15/184017052431.shtml;然後通過程式便有可能把2009-01-15抽取出來。實施中,具體的抽取手段可以包括:A、列舉常用的時間格式,並建立時間格式維表用以儲存常用的時間格式;B、按照分割符對URL進行切分;C、將切分後的每一部份在時間格式維表中進行查詢,若與該維表中的時間格式相匹配,則說明該URL中含有時間,便可以提取該時間。In the implementation, the actual release time of the webpage can be obtained by means of computer program extraction. Since the web pages of most websites are dynamically generated, the Last-modified field returned by the web server has no meaning, so the time can be extracted from the body of the web page. The extraction time can be as follows: First, determine whether the URL contains time. For example, the URL (Uniform Resource Locator) in the following example contains time: http://news.sina.com.cn /w/2009-01-15/184017052431.shtml; Then it is possible to extract 2009-01-15 through the program. In the implementation, the specific extraction means may include: A, enumerating the commonly used time format, and establishing a time format dimension table for storing the commonly used time format; B, segmenting the URL according to the separator; C, the segmentation Each part is queried in the time format dimension table. If it matches the time format in the dimension table, it means that the time contains the time, and the time can be extracted.

如果URL中沒有時間,則從文章正文中獲取。文章正文中的時間格式有很多種,實施中只要根據實際情況將電腦程式考慮周全,便可以儘快找到時間。如下面的例子中文章正文中便含有時間:If there is no time in the URL, it is taken from the body of the article. There are many kinds of time formats in the body of the article. In the implementation, as long as the computer program is considered according to the actual situation, you can find the time as soon as possible. As in the example below, the text in the article contains time:

2009年01月15日18:40中國網January 15, 2009, 18:40 China Net

2009年12月27日23:35December 27, 2009 23:35

通過程式便可以很容易的把2009年12月27日23:35抽取出來。It is easy to extract 23:35 on December 27, 2009 through the program.

實施中,在具體的實現手段上,可以通過分析網頁中各種時間、日期格式的代碼,用正則運算式匹配等任意程式方式來進行獲取。如果程式不能確定產生時間,則取當前抓取的時間作為產生時間。實施中不論如何實現計算產生時間,其目的在於將獲取的產生時間用以識別各個轉載網頁的原創版本。In the implementation, in the specific implementation means, the code can be obtained by analyzing the code of various time and date formats in the webpage, and using any program method such as regular expression matching. If the program cannot determine the generation time, the current crawl time is taken as the generation time. Regardless of how the calculation generation time is implemented in the implementation, the purpose is to use the acquisition generation time to identify the original version of each reprinted web page.

實施中可以在抓取網頁、建立網頁索引時就判斷產生時間,並將產生時間儲存在網頁索引的一個欄位(FIELD)裏。In the implementation, the time can be judged when the webpage is crawled and the index of the webpage is established, and the generation time is stored in a field of the webpage index (FIELD).

實施中,當存在無法從文章或URL中抽取產生時間的情況時,可以使用抓取網頁的時間作為產生時間,也可以把最早收入索引的時間假定為文章產生時間。In the implementation, when there is a situation in which the time of generation cannot be extracted from the article or the URL, the time of crawling the webpage may be used as the generation time, or the time of the earliest income index may be assumed as the article generation time.

在通過上述方式確定出內容相同的網頁以及其產生時間後,便可以判斷出誰是原創網頁,亦即,在所有相同網頁中,找到真正發佈時間最早的網頁即為原創網頁。After determining the webpage with the same content and the time of its generation by the above manner, it can be determined who is the original webpage, that is, in all the same webpages, the webpage with the earliest real release time is the original webpage.

下面對步驟202中的評估值進行說明。The evaluation values in step 202 will be described below.

首先對Page Rank進行說明,以便更深入的理解本發明中所定義的Copy Rank,Page Rank是Google對某個網頁的重要性評估的評估值,是Page Rank,而不是"Site Rank(網站評估值)",不是對整個網站的評估值。如果一個網站首頁的Page Rank是5,那只是說首頁那個頁面的Page Rank是5,而不是說整個網站是5。Google的Page Rank不針對網站而言,只針對頁面,一個個的頁面。First, the Page Rank is explained to better understand the Copy Rank defined in the present invention. Page Rank is the evaluation value of Google's evaluation of the importance of a web page, which is Page Rank, not "Site Rank". )", not an evaluation of the entire site. If the Page Rank of a website's homepage is 5, it just means that the Page Rank of the page on the front page is 5, instead of saying that the entire website is 5. Google's Page Rank is not for the website, only for the page, one page.

某個頁面的Page Rank值,主要來自於指向這個頁面的所有鏈結所代表的那些頁面。所謂“所有鏈結”包括兩部分:本網站之外的外部鏈結和本網站內的其他頁面的內部鏈結。也就是說,任何一個頁面的Page Rank值,是由外部鏈結和內部鏈結共同作用而產生的。而不只是由外部鏈結或只由內部鏈結單方面作用而產生。假設一個網站的首頁因為有兩個Page Rank為5的外部鏈結指過來,加上還有更多的內部鏈結指向首頁,才使網站首頁的Page Rank為5。The Page Rank value of a page comes mainly from the pages represented by all the links pointing to this page. The so-called "all links" consists of two parts: the external links outside the website and the internal links of other pages within the website. In other words, the Page Rank value of any page is generated by the interaction of the external link and the internal link. It is not just the unilateral role of an external link or only an internal link. Suppose a homepage of a website has two external pages with a Page Rank of 5, and there are more internal links pointing to the home page, so that the page rank of the home page is 5.

同樣道理,在本發明實施中,當在步驟203根據相同或接近相同的內容的各網頁的評估值確定產生時間最早的網頁的評估值時,便可以確定第二評估值為與產生時間最早的網頁具有相同或接近相同的內容的網頁的第一評估值之和與第一加權係數的乘積加上所述產生時間最早的網頁的第一評估值與第二加權係數的乘積。By the same token, in the implementation of the present invention, when the evaluation value of the webpage with the earliest generation time is determined according to the evaluation value of each webpage of the same or nearly the same content in step 203, the second evaluation value can be determined to be the earliest generation time. The product of the sum of the first evaluation values of the web pages having the same or nearly the same content and the first weighting coefficient plus the product of the first evaluation value and the second weighting coefficient of the web page of the earliest generation time.

也就是說,Copy Rank使得原創網頁獲得了所有轉載網頁的權重,亦即,Copy Rank可按如下公式計算:That is to say, Copy Rank makes the original web page get the weight of all reprinted web pages, that is, Copy Rank can be calculated as follows:

原創網頁的Copy Rank=Σ每個轉載網頁的Page Rank*w1+原創網頁的Page Rank*w2;其中W1和W2為加權係數,W1和W2的值可在實施中按照需要自行設定,並且W1和W2的取值可以相同也可以不同。Copy Page of the original page = Page Rank*w1 of each reprinted page + Page Rank*w2 of the original page; where W1 and W2 are weighting coefficients, the values of W1 and W2 can be set according to the needs in the implementation, and W1 and W2 The values can be the same or different.

需要說明的是,本發明實施例中用以說明評估值的是Page Rank,但是,實際上根據包括其他網頁指向在內的資料而形成的評估值都可以用來實現。It should be noted that, in the embodiment of the present invention, the evaluation value is used to describe the Page Rank. However, the evaluation value formed according to the data including other webpage pointers can be used for implementation.

另外,實施中可以在抓取網頁下來後產生Copy Rank,也可以定期更新所有網頁的CopyRank。In addition, in the implementation, the Copy Rank can be generated after the webpage is crawled, and the CopyRank of all the webpages can be updated regularly.

實施中,在確定原創網頁時還可以根據歷史資料或經驗建立一個網站黑名單和/或白名單,屬於白名單上網站的網頁假定為原創網頁,而屬於黑名單上網站的網頁假定為非原創網頁。In the implementation, when determining the original webpage, a website blacklist and/or whitelist may be established based on historical data or experience. The webpage belonging to the whitelisted website is assumed to be an original webpage, and the webpage belonging to the blacklisted webpage is assumed to be non-original. Web page.

圖3為轉載網頁與原創網頁Copy Rank關係示意圖,如圖所示,將外鏈網頁給所有轉載網頁的評估值權重,全部給了原創網頁,相當於從外部看,這些網頁外鏈的評估值都給了原創網頁。Figure 3 is a schematic diagram of the relationship between the reprinted webpage and the original webpage Copy Rank. As shown in the figure, the weight of the evaluation value of the external webpage to all the reprinted webpages is given to the original webpage, which is equivalent to the evaluation value of the outer links of these webpages from the outside. Both gave original pages.

圖4為根據網頁評估值對查詢結果進行排序的方法實施流程示意圖,如圖所示,在將網頁評估值運用於返回查詢結果以提高搜尋準確性的實施過程中可以包括如下步驟:步驟401、從搜尋引擎伺服器系統獲取查詢後的查詢結果;步驟402、搜尋引擎伺服器系統在查詢結果中獲取具有相同或接近相同的內容的網頁;步驟403、搜尋引擎伺服器系統確定各網頁的產生時間及第一評估值;步驟404、搜尋引擎伺服器系統根據各網頁的第一評估值確定產生時間最早的網頁的第二評估值;步驟405、根據各網頁的第一評估值及第二評估值對查詢結果排序。4 is a schematic flowchart of a method for performing a method for sorting query results according to a webpage evaluation value. As shown in the figure, in the implementation process of using a webpage evaluation value to return a query result to improve search accuracy, the following steps may be included: Step 401: Obtaining the query result after the query from the search engine server system; step 402, the search engine server system obtains the webpage having the same or nearly the same content in the query result; and step 403, the search engine server system determines the generation time of each webpage And the first evaluation value; step 404, the search engine server system determines, according to the first evaluation value of each webpage, a second evaluation value of the webpage with the earliest generation time; step 405, according to the first evaluation value and the second evaluation value of each webpage Sort the query results.

搜尋引擎伺服器系統在步驟405的實施中便可以根據評估值對查詢到的網頁排序,比如按評估值大小排序後依次返回並顯示給查詢的用戶。In the implementation of step 405, the search engine server system may sort the queried web pages according to the evaluation value, for example, sorted by the evaluation value and then returned and displayed to the queried user.

進一步的,搜尋引擎伺服器系統還可以在查詢結果中顯示每個網頁的轉載次數。Further, the search engine server system can also display the number of reprints of each web page in the query result.

基於同一發明構思,本發明實施例中還提供了一種搜尋引擎伺服器系統,由於系統解決問題的原理與網頁評估值的確定方法、根據網頁評估值返回查詢結果的方法相似,因此系統的實施可以參見方法的實施,重複之處不再贅述。Based on the same inventive concept, the search engine server system is further provided in the embodiment of the present invention. Since the principle of the system solving problem is similar to the method for determining the evaluation value of the webpage and the method for returning the query result according to the evaluation value of the webpage, the implementation of the system may be See the implementation of the method, and the repetition will not be repeated.

圖5為搜尋引擎伺服器系統結構示意圖,如圖所示,搜尋引擎伺服器系統中可以包括:爬蟲系統501,用以獲取具有相同或接近相同的內容的網頁;索引系統502,用以確定各網頁的產生時間及各網頁的第一評估值,並根據各網頁的第一評估值確定產生時間最早的網頁的第二評估值。5 is a schematic structural diagram of a search engine server system. As shown in the figure, the search engine server system may include: a crawler system 501 for acquiring web pages having the same or nearly the same content; and an indexing system 502 for determining each The generation time of the webpage and the first evaluation value of each webpage, and determining the second evaluation value of the webpage with the earliest generation time according to the first evaluation value of each webpage.

實施中,索引系統可以進一步用以根據網頁的MD5確定各網頁是否具有相同或接近相同的內容。In an implementation, the indexing system may be further configured to determine whether each webpage has the same or nearly the same content according to the MD5 of the webpage.

網頁獲取模組中可以包括:MD5產生單元,用以獲取各網頁中非第一段和非最後一段的中間內容最長的段落或段落非第一句和非最後一句的最長句子,並產生MD5;比較單元,用以根據MD5確定各網頁內容是否相同;獲取單元,用以根據MD5確定各網頁內容是否相同後,獲取具有相同或接近相同的內容的網頁。The webpage obtaining module may include: an MD5 generating unit, configured to obtain a longest sentence of a non-first segment and a non-final segment of each webpage, or a longest sentence of a non-first sentence and a non-last sentence, and generate an MD5; And a comparing unit, configured to determine, according to the MD5, whether the content of each webpage is the same; the acquiring unit, configured to obtain, according to the MD5, whether the content of each webpage is the same, and obtain a webpage having the same or nearly the same content.

實施中,索引系統可以進一步用以根據網頁URL包含的時間,和/或,根據內容類網頁中的時間確定網頁產生時間。In an implementation, the indexing system can be further configured to determine the webpage generation time based on the time contained in the webpage URL and/or based on the time in the content category webpage.

實施中,索引系統還可以進一步用於根據各網頁的第一評估值確定產生時間最早的網頁的第二評估值時,確定所述第二評估值為與所述產生時間最早的網頁具有相同或接近相同的內容的網頁的第一評估值之和與第一加權係數 的乘積加上所述產生時間最早的網頁的第一評估值與第二加權係數的乘積。In an implementation, the indexing system may be further configured to: when determining, according to the first evaluation value of each webpage, the second evaluation value of the webpage with the earliest generation time, determining that the second evaluation value is the same as the webpage with the earliest generation time or The sum of the first evaluation values of the web pages close to the same content and the first weighting coefficient The product of the first evaluation value of the web page with the earliest generation time and the second weighting coefficient.

實施中,索引系統還可以進一步用以根據包括外鏈在內的資料而形成的評估值確定各網頁的第一評估值。In an implementation, the indexing system may further be configured to determine a first evaluation value of each webpage according to the evaluation value formed by the data including the outer chain.

索引系統還可以用以根據根據各網頁的第一評估值及第二評估值對查詢結果排序。The indexing system can also be configured to sort the query results according to the first evaluation value and the second evaluation value according to each web page.

實施中,索引系統還可以進一步用以在查詢結果中顯示每個網頁的轉載次數。In the implementation, the indexing system may further be used to display the number of reprints of each webpage in the query result.

圖6為搜尋引擎伺服器系統運用環境結構示意圖,如圖所示,網路600中包括有根據網頁評估值對查詢結果進行排序的索引系統601、網頁602(代表產生網頁的各種實體,具體的網頁可以表現為伺服器等,實施例中用網頁來指代這類實體僅是為了描述方便,同時,這類實體可以有很多,圖中僅以一個示意)、用戶端603(圖中僅用一個示意)、爬蟲系統604、查詢系統605。6 is a schematic diagram of an operating environment structure of a search engine server system. As shown in the figure, the network 600 includes an indexing system 601 and a webpage 602 for sorting query results according to webpage evaluation values (representing various entities that generate webpages, specific The webpage can be represented by a server or the like. In the embodiment, the webpage is used to refer to such an entity for the convenience of description. At the same time, there may be many such entities, and only one is shown in the figure, and the client 603 (only used in the figure) A schematic), crawler system 604, query system 605.

由圖也可見,索引系統601與爬蟲系統604也構成了搜尋引擎伺服器系統,需要說明的是,圖中各功能實體的連接方式有通過網路連接,也有以直線表示的直接連接,但是,該圖僅為示意圖,在實際實施中,可以根據實際需要進行網路架構,比如:爬蟲系統與索引系統通過網際網路連接,而非局部區域網路連接等,只要各實體之間能實現資料交互的連接方式均可實施本發明。It can also be seen from the figure that the indexing system 601 and the crawler system 604 also constitute a search engine server system. It should be noted that the connection manner of each functional entity in the figure is through a network connection, and a direct connection represented by a straight line, however, The figure is only a schematic diagram. In actual implementation, the network architecture can be implemented according to actual needs. For example, the crawler system and the index system are connected through the Internet, instead of the local area network connection, as long as the data can be realized between the entities. The invention can be implemented in an interactive manner.

實施中,網頁602提供各種網頁內容,爬蟲系統604可以在網路中採集各種網頁資訊,並將網頁資訊儲存在一個或多個伺服器上。本發明中的索引系統601根據採集到的網頁資訊建立索引,以便快速處理查詢請求。索引系統601還可以確定網頁的第一評估值和第二評估值,並根據上述第一評估值、第二評估值進行網頁的排序。所述排序可以在爬蟲系統採集到網頁資訊之後立刻進行,也可以在收到用戶端的查詢請求之後再進行,本發明對此並不做限定。In implementation, the web page 602 provides various web page content, and the crawler system 604 can collect various web page information on the network and store the web page information on one or more servers. The indexing system 601 in the present invention builds an index based on the collected webpage information to quickly process the query request. The indexing system 601 can also determine the first evaluation value and the second evaluation value of the webpage, and perform sorting of the webpage according to the first evaluation value and the second evaluation value. The sorting may be performed immediately after the crawler system collects the webpage information, or may be performed after receiving the query request from the user terminal, which is not limited by the present invention.

當用戶端603通過網路到查詢系統605中進行查詢時,查詢系統605便可以根據排序裝置601的排序結果,將用戶端603所需的資訊返回,使得用戶獲得的查詢結果排序準確,能夠真實地反映查詢結果之間的關係。When the user 603 queries through the network to the query system 605, the query system 605 can return the information required by the user terminal 603 according to the sorting result of the sorting device 601, so that the query results obtained by the user are accurately sorted and can be true. The relationship between the query results is reflected.

由上述實施例可以看出,本發明在實施時使用了內容被轉載的次數和基於轉載計算出來的Copy Rank值,Copy rank是能夠改進搜尋引擎相關度排序的參數,適用於最佳化內容類查詢的搜尋結果排序。能夠利用網際網路上文章的轉載次數計算原創網頁的Copy rank,並對轉載網頁進行聚合,因此在搜尋引擎計算相關度時,便可以綜合如page rank等根據包括外鏈在內的資料而形成的評估值、關鍵字匹配程度等傳統計算相關度的參數和Copy rank一起計算出一個新的相關度值,在搜尋引擎顯示結果時,也顯示轉載數目,以幫助用戶最快判斷網際網路上符合此查詢詞的最佳結果,因而能夠提高搜尋引擎返回結果的準確性。本領域技術人員易知,搜尋引擎包括網頁搜尋引擎、圖片搜尋引擎、軟體搜尋引擎等,本發明的技術方案能夠提高搜尋引擎結果的準確性,包括對搜尋結果排序順序的影響(使得轉載次數更高的結果排在前面),也包括對搜尋結果介面的影響(在結果頁面上顯示轉載的次數,在結果頁面上優先展示原創內容)等。It can be seen from the above embodiment that the present invention uses the number of times the content is reprinted and the Copy Rank value calculated based on the reprint, and the Copy rank is a parameter that can improve the ranking of the search engine relevance, and is suitable for optimizing the content class. Sort the search results of the query. The number of reprints of articles on the Internet can be used to calculate the copy rank of the original webpage, and the reprinted webpages are aggregated. Therefore, when the search engine calculates the relevance, it can synthesize the data such as the page rank based on the data including the outer chain. The value of the traditional calculation correlation, such as the evaluation value and the keyword matching degree, together with the Copy rank, calculates a new correlation value. When the search engine displays the result, the number of reprints is also displayed to help the user to judge the Internet as quickly as possible. Query the best results of the word, thus improving the accuracy of the results returned by the search engine. As is known to those skilled in the art, the search engine includes a webpage search engine, a picture search engine, a software search engine, and the like. The technical solution of the present invention can improve the accuracy of the search engine results, including the impact on the sorting order of the search results (making the number of reprints more High results are ranked first), including the impact on the search results interface (displays the number of reprints on the results page, prioritizes the original content on the results page).

本發明實施例中為搜尋引擎的結果排序增加一個新的重要的排序參數,大幅度地提高內容類查詢詞的搜尋結果效果,能夠使得用戶在網頁搜尋中找文章的查詢結果滿意度大幅度地提高。In the embodiment of the present invention, a new important sorting parameter is added to the ranking of the search engine, and the search result effect of the content query word is greatly improved, so that the user can find the query result of the article in the webpage search with a large satisfaction degree. improve.

為了描述的方便,以上所述裝置的各部分以功能分為各種模組或單元分別描述。當然,在實施本發明時可以把各模組或單元的功能在同一個或多個軟體或硬體中實現。For the convenience of description, the various parts of the above described devices are separately described by functions into various modules or units. Of course, the functions of the modules or units can be implemented in the same or multiple software or hardware in the practice of the present invention.

圖7為利用電腦對網頁搜尋查詢結果進行排序的方法實施流程示意圖,如圖所示,當進行排序時可以包括如下步驟:步驟701、從搜尋引擎伺服器系統獲取具有相同或接近相同的內容的網頁;步驟702、搜尋引擎伺服器系統確定所述各網頁的產生時間;步驟703、搜尋引擎伺服器系統根據所述各網頁的產生時間的先後順序進行排序。FIG. 7 is a schematic flowchart of a method for sorting webpage search query results by using a computer. As shown in the figure, when sorting, the method may include the following steps: Step 701: Obtaining the same or nearly the same content from the search engine server system. Step 702: The search engine server system determines the generation time of each webpage; and step 703, the search engine server system sorts according to the order in which the webpages are generated.

進一步的,還可以包括:搜尋引擎伺服器系統根據所述各網頁的產生時間以及外鏈資料進行排序。Further, the method may further include: the search engine server system sorts according to the generation time of the webpages and the external link data.

圖8為搜尋引擎伺服器系統結構示意圖,如圖所示,包括:爬蟲系統801,用以從搜尋引擎伺服器系統獲取具有相同或接近相同的內容的網頁;搜尋引擎伺服器802,用以搜尋引擎伺服器系統確定所述各網頁的產生時間,並根據所述各網頁的產生時間的先後順序進行排序。8 is a schematic structural diagram of a search engine server system, as shown, including: a crawler system 801 for acquiring web pages having the same or nearly the same content from a search engine server system; a search engine server 802 for searching The engine server system determines the generation time of each webpage, and sorts according to the order in which the webpages are generated.

搜尋引擎伺服器802還可以進一步用以根據所述各網頁的產生時間以及外鏈資料進行排序。The search engine server 802 can be further configured to sort according to the generation time of the web pages and the external link data.

本領域內的技術人員應明白,本發明的實施例可提供為方法、系統、或電腦程式產品。因此,本發明可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且,本發明可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁盤記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Thus, the present invention can take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment combining soft and hardware aspects. Moreover, the present invention can take the form of a computer program product embodied on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) including computer usable code.

本發明是參照根據本發明實施例的方法、設備(系統)、和電腦程式產品的流程圖和/或方塊圖來描述的。應理解可由電腦程式指令實現流程圖和/或方塊圖中的每一流程和/或方塊、以及流程圖和/或方塊圖中的流程和/或方塊的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可編程資料處理設備的處理器以產生一個機器,使得通過電腦或其他可編程資料處理設備的處理器執行的指令產生用以實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的裝置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of a method, apparatus (system), and computer program product according to embodiments of the invention. It will be understood that each flow and/or block of the flowcharts and/or <RTIgt; These computer program instructions can be provided to a processor of a general purpose computer, a special purpose computer, an embedded processor or other programmable data processing device to produce a machine for generating instructions for execution by a processor of a computer or other programmable data processing device. A device that implements the functions specified in one or more flows of a flowchart or a plurality of flows and/or block diagrams.

這些電腦程式指令也可儲存在能引導電腦或其他可編程資料處理設備以特定方式工作的電腦可讀取記憶體中,使得儲存在該電腦可讀取記憶體中的指令產生包括指令裝置的製造產品,該指令裝置實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能。The computer program instructions can also be stored in a computer readable memory that can boot a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory include the manufacture of the command device. The product, the instruction device implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

這些電腦程式指令也可裝載到電腦或其他可編程資料處理設備上,使得在電腦或其他可編程設備上執行一系列操作步驟以產生電腦實現的處理,從而在電腦或其他可編程設備上執行的指令提供用以實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的步驟。These computer program instructions can also be loaded onto a computer or other programmable data processing device to perform a series of operational steps on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flows of the flowchart or in a block or blocks of the flowchart.

儘管已描述了本發明的較佳實施例,但本領域內的技術人員一旦得知了基本創造性概念,則可對這些實施例作出另外的變更和修改。所以,所附申請專利範圍意欲解釋為包括較佳實施例以及落入本發明範圍的所有變更和修改。Although the preferred embodiment of the invention has been described, it will be apparent to those skilled in Therefore, the scope of the appended claims is intended to be construed as a

顯然,本領域的技術人員可以對本發明進行各種修改和變型而不脫離本發明的精神和範圍。這樣,倘若本發明的這些修改和變型屬於本發明申請專利範圍及其等同技術的範圍之內,則本發明也意圖包含這些修改和變型在內。It will be apparent that those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and the modifications thereof

501‧‧‧爬蟲系統501‧‧‧Reptile system

502‧‧‧索引系統502‧‧‧ Index System

601‧‧‧索引系統601‧‧‧ indexing system

602‧‧‧網頁602‧‧‧Webpage

603‧‧‧用戶端603‧‧‧User side

604‧‧‧爬蟲系統604‧‧‧Reptile system

605‧‧‧查詢系統605‧‧‧Query System

801‧‧‧爬蟲系統801‧‧‧Reptile system

802‧‧‧搜尋引擎伺服器802‧‧‧Search Engine Server

圖1為本發明實施例中Copy Rank在搜尋引擎結果中的效果示意圖;FIG. 1 is a schematic diagram of an effect of an Copy Rank in a search engine result according to an embodiment of the present invention; FIG.

圖2為本發明實施例中網頁評估值的確定方法實施流程示意圖;圖3為本發明實施例中轉載網頁與原創網頁Copy Rank關係示意圖;圖4為本發明實施例中根據網頁評估值對查詢結果進行排序的方法實施流程示意圖;圖5為本發明實施例中搜尋引擎伺服器系統結構示意圖;圖6為本發明實施例中搜尋引擎伺服器系統運用環境結構示意圖;圖7為本發明實施例中利用電腦對網頁搜尋查詢結果進行排序的方法實施流程示意圖;圖8為本發明實施例中搜尋引擎伺服器系統結構示意圖。2 is a schematic flowchart of a method for determining a method for determining a webpage value according to an embodiment of the present invention; FIG. 3 is a schematic diagram of a relationship between a reprinted webpage and an original webpage Copy Rank according to an embodiment of the present invention; and FIG. 4 is a query according to a webpage evaluation value according to an embodiment of the present invention; FIG. 5 is a schematic structural diagram of a search engine server system according to an embodiment of the present invention; FIG. 6 is a schematic diagram of an operating environment structure of a search engine server system according to an embodiment of the present invention; A schematic diagram of a method for implementing a method for sorting webpage search query results by using a computer; FIG. 8 is a schematic structural diagram of a search engine server system according to an embodiment of the present invention.

Claims (16)

一種利用電腦對網頁評估值進行確定的方法,其特徵在於,包括如下步驟:從搜尋引擎伺服器系統獲取具有相同或接近相同的內容的網頁;該搜尋引擎伺服器系統確定該各網頁的產生時間及第一評估值;該搜尋引擎伺服器系統根據該各網頁的第一評估值確定產生時間最早的網頁的第二評估值,其中,該第二評估值為與該產生時間最早的網頁具有相同或接近相同的內容的網頁的第一評估值之和與第一加權係數的乘積加上該產生時間最早的網頁的第一評估值與第二加權係數的乘積。 A method for determining a webpage evaluation value by using a computer, comprising: acquiring a webpage having the same or nearly the same content from a search engine server system; the search engine server system determining a generation time of the webpage And a first evaluation value; the search engine server system determines, according to the first evaluation value of each webpage, a second evaluation value of the webpage with the earliest generation time, wherein the second evaluation value is the same as the webpage with the earliest generation time Or the product of the sum of the first evaluation values of the web pages close to the same content and the first weighting coefficient plus the product of the first evaluation value and the second weighting coefficient of the web page of the earliest generation time. 如申請專利範圍第1項所述的方法,其中,該具有相同或接近相同的內容的網頁包括數位指紋相同的網頁。 The method of claim 1, wherein the webpage having the same or nearly the same content comprises a webpage having the same digit fingerprint. 如申請專利範圍第2項所述的方法,其中,該獲取具有相同或接近相同的內容的網頁包括:獲取各網頁中非第一段和非最後一段的中間內容最長的段落或段落非第一句和非最後一句的最長句子,並產生數位指紋;根據數位指紋是否相同確定各網頁內容是否相同後獲取具有相同或接近相同的內容的網頁。 The method of claim 2, wherein the obtaining the webpage having the same or nearly the same content comprises: obtaining the longest paragraph or paragraph of the non-first segment and the non-last segment of each webpage is not the first The longest sentence of the sentence and the last sentence, and produces a digital fingerprint; obtain a web page with the same or nearly the same content according to whether the digital fingerprint is the same or not. 如申請專利範圍第1項所述的方法,其中,該確定該各網頁的產生時間包括下列方式之一或者其組合: 根據網頁統一資源定位符URL包含的時間確定;根據內容類網頁中的時間確定;根據抓取網頁的時間確定;根據最早將網頁收入索引的時間確定。 The method of claim 1, wherein the determining the time of generation of the webpages comprises one or a combination of the following: Determined according to the time included in the webpage uniform resource locator URL; determined according to the time in the content category webpage; determined according to the time of crawling the webpage; determined according to the time when the webpage is first indexed. 如申請專利範圍第1項所述的方法,其中,該第二評估值大於該第一評估值。 The method of claim 1, wherein the second evaluation value is greater than the first evaluation value. 如申請專利範圍第1項所述的方法,其中,該第一加權係數與該第二加權係數的取值相同或不同。 The method of claim 1, wherein the first weighting coefficient is the same as or different from the value of the second weighting coefficient. 如申請專利範圍第1至6項任一項所述的方法,其中,該第一評估值為根據包括外鏈在內的資料而形成的評估值。 The method of any one of claims 1 to 6, wherein the first evaluation value is an evaluation value formed based on data including an outer chain. 一種根據如申請專利範圍第1至7項任一項所述的網頁評估值對搜尋查詢結果進行排序的方法,其特徵在於,包括如下步驟:從搜尋引擎伺服器系統獲取查詢後的查詢結果;該搜尋引擎伺服器系統根據各網頁的第一評估值及產生時間最早的網頁的第二評估值對查詢結果排序。 A method for sorting search query results according to the webpage evaluation value according to any one of claims 1 to 7, characterized in that the method comprises the following steps: obtaining the query result after the query from the search engine server system; The search engine server system sorts the query results according to the first evaluation value of each webpage and the second evaluation value of the webpage with the earliest generation time. 如申請專利範圍第8項所述的方法,進一步包括:該搜尋引擎伺服器系統在查詢結果中顯示每個網頁的轉載次數。 The method of claim 8, further comprising: the search engine server system displaying the number of reprints of each web page in the query result. 一種搜尋引擎伺服器系統,其特徵在於,包括:爬蟲系統,用以獲取具有相同或接近相同的內容的網頁;索引系統,用以確定各網頁的產生時間及各網頁的第 一評估值,並根據各網頁的第一評估值確定產生時間最早的網頁的第二評估值,其中,該索引系統進一步用以根據各網頁的第一評估值確定產生時間最早的網頁的第二評估值時,確定該第二評估值為與該產生時間最早的網頁具有相同或接近相同的內容的網頁的第一評估值之和與第一加權係數的乘積加上該產生時間最早的網頁的第一評估值與第二加權係數的乘積。 A search engine server system, comprising: a crawler system for acquiring web pages having the same or nearly the same content; an indexing system for determining the generation time of each web page and the number of each web page An evaluation value, and determining, according to the first evaluation value of each webpage, a second evaluation value of the webpage with the earliest generation time, wherein the indexing system is further configured to determine, according to the first evaluation value of each webpage, the second webpage of the earliest generation time When evaluating the value, determining that the second evaluation value is the product of the sum of the first evaluation values of the webpage having the same or nearly the same content as the webpage with the earliest generation time and the first weighting coefficient plus the webpage of the earliest generation time The product of the first evaluation value and the second weighting coefficient. 如申請專利範圍第10項所述的搜尋引擎伺服器系統,其中,該索引系統進一步用以根據網頁的數位指紋確定各網頁是否具有相同或接近相同的內容。 The search engine server system of claim 10, wherein the indexing system is further configured to determine whether each webpage has the same or nearly the same content according to the digital fingerprint of the webpage. 如申請專利範圍第10項所述的搜尋引擎伺服器系統,其中,該索引系統包括:數位指紋產生單元,用以獲取各網頁中非第一段和非最後一段的中間內容最長的段落或段落非第一句和非最後一句的最長句子,並產生數位指紋;比較單元,用以根據數位指紋確定各網頁內容是否相同;獲取單元,用以根據數位指紋是否相同確定各網頁內容是否相同後,獲取具有相同或接近相同的內容的網頁。 The search engine server system of claim 10, wherein the indexing system comprises: a digital fingerprint generating unit, configured to obtain a paragraph or paragraph having the longest middle content of each of the non-first and non-last segments of each webpage. The longest sentence that is not the first sentence and the last sentence, and generates a digital fingerprint; the comparing unit is configured to determine whether the content of each webpage is the same according to the digital fingerprint; and the obtaining unit is configured to determine whether the content of each webpage is the same according to whether the digital fingerprint is the same or not, Get web pages with the same or nearly the same content. 如申請專利範圍第10項所述的搜尋引擎伺服器系統,其中,該索引系統進一步用以根據下列方式之一或者其組合確定網頁產生時間:網頁統一資源定位符URL包含的時間; 內容類網頁中的時間;抓取網頁的時間;最早將網頁收入索引的時間。 The search engine server system of claim 10, wherein the indexing system is further configured to determine a webpage generation time according to one or a combination of the following manners: a time when the webpage uniform resource locator URL is included; The time in the content class page; the time the page was crawled; the time when the page was first indexed. 如申請專利範圍第10至13項任一項所述的搜尋引擎伺服器系統,其中,該索引系統進一步用以根據包括其他網頁指向在內的資料而形成的評估值確定各網頁的第一評估值。 The search engine server system according to any one of claims 10 to 13, wherein the indexing system is further configured to determine a first evaluation of each webpage based on an evaluation value formed by including other webpage pointing data. value. 如申請專利範圍第10項所述的搜尋引擎伺服器系統,其中,該索引系統還用以根據各網頁的第一評估值及第二評估值對查詢結果排序。 The search engine server system of claim 10, wherein the indexing system is further configured to sort the query results according to the first evaluation value and the second evaluation value of each webpage. 如申請專利範圍第15項所述的搜尋引擎伺服器系統,其中,該索引系統進一步用以在查詢結果中顯示每個網頁的轉載次數。 The search engine server system of claim 15, wherein the indexing system is further configured to display the number of reprints of each webpage in the query result.
TW098133414A 2009-10-01 2009-10-01 The method of determining and using the method of web page evaluation TWI497322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW098133414A TWI497322B (en) 2009-10-01 2009-10-01 The method of determining and using the method of web page evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW098133414A TWI497322B (en) 2009-10-01 2009-10-01 The method of determining and using the method of web page evaluation

Publications (2)

Publication Number Publication Date
TW201113726A TW201113726A (en) 2011-04-16
TWI497322B true TWI497322B (en) 2015-08-21

Family

ID=44909753

Family Applications (1)

Application Number Title Priority Date Filing Date
TW098133414A TWI497322B (en) 2009-10-01 2009-10-01 The method of determining and using the method of web page evaluation

Country Status (1)

Country Link
TW (1) TWI497322B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6560600B1 (en) * 2000-10-25 2003-05-06 Alta Vista Company Method and apparatus for ranking Web page search results
TW200620002A (en) * 2004-12-02 2006-06-16 Taiwan Semiconductor Mfg Co Ltd System and method for text searching using weighted keywords
US20070067304A1 (en) * 2005-09-21 2007-03-22 Stephen Ives Search using changes in prevalence of content items on the web
TW200719183A (en) * 2005-08-15 2007-05-16 Microsoft Corp Ranking functions using a biased click distance of a document on a network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6560600B1 (en) * 2000-10-25 2003-05-06 Alta Vista Company Method and apparatus for ranking Web page search results
TW200620002A (en) * 2004-12-02 2006-06-16 Taiwan Semiconductor Mfg Co Ltd System and method for text searching using weighted keywords
TW200719183A (en) * 2005-08-15 2007-05-16 Microsoft Corp Ranking functions using a biased click distance of a document on a network
US20070067304A1 (en) * 2005-09-21 2007-03-22 Stephen Ives Search using changes in prevalence of content items on the web

Also Published As

Publication number Publication date
TW201113726A (en) 2011-04-16

Similar Documents

Publication Publication Date Title
US9223880B2 (en) Evaluation of web pages
Craswell et al. Random walks on the click graph
US8577878B1 (en) Filtering search results using annotations
US7882139B2 (en) Content oriented index and search method and system
JP5513624B2 (en) Retrieving information based on general query attributes
US7406466B2 (en) Reputation based search
JP4919515B2 (en) Duplicate document detection and display function
US7860971B2 (en) Anti-spam tool for browser
US20090083222A1 (en) Information Retrieval Using Query-Document Pair Information
JP6053131B2 (en) Information processing apparatus, information processing method, and program
WO2017113592A1 (en) Model generation method, word weighting method, apparatus, device and computer storage medium
Ionescu et al. Div150cred: A social image retrieval result diversification with user tagging credibility dataset
CN112883030A (en) Data collection method and device, computer equipment and storage medium
Bar-Yossef et al. Efficient search engine measurements
Klein et al. Evaluating methods to rediscover missing web pages from the web infrastructure
US8874565B1 (en) Detection of proxy pad sites
US7788258B1 (en) Automatic determination of whether a document includes an image gallery
TWI497322B (en) The method of determining and using the method of web page evaluation
CN111782916B (en) Method and device for generating business information report
Blanco et al. Supporting the automatic construction of entity aware search engines
JP2011108242A (en) Method and system for dynamically extracting and providing most suitable image according to user&#39;s request, and computer-readable recording medium
EP2662785A2 (en) A method and system for non-ephemeral search
Lee et al. Geographically-sensitive link analysis
Oyama et al. Identification of time-varying objects on the web
Subatra A Novel Approach on Focused Crawling with Anchor Text

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees