Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of method that can rapidly the search engine on line be evaluated to robotization, by the method, can contrast the result difference between each search engine, be applicable to carrying out the regular contrast evaluation and test between each search engine and when Optimizing Search engine, evaluate and test continually to check that whether the algorithm of optimization is successful.
In order to realize foregoing invention object, the present invention takes following technical scheme:
A multiple search engine robotization contrast evaluating method that relies on document library, is characterized in that, described method comprises the steps:
A. select evaluation and test word;
B. capture Search Results and save as document;
C. extract document text;
D. calculate correlativity;
E. integrate document and by its relevance ranking;
F. calculate DCG;
G. by DCG result, sort, sum up evaluation result.
Preferably, described evaluation and test word comprises: the page searched key word in Webpage search, film title or the actor name in video search.
Preferably, it is characterized in that, described crawl comprises crawl process twice;
Capture and comprise for the first time: according to keyword, generate the Search Results link of search engine, capture for the first time, by template, from each search engine, extract the link of relevant information He each results page details of each result, and preserve; Described template is the regular expression that comprises search condition;
Capture and comprise for the second time: according to the link of the results page details that obtain in capturing for the first time, capture respective page, and save as respectively in order document.
Preferably, the extracting method of described text comprises: the HTML extracting method based on dom tree, text be the text extraction method of long string;
The described HTML extracting method based on dom tree comprises: html text is changed into a dom tree, then according to the node analysis of dom tree, extract the content that text is relevant, to remove irrelevant information in the page; This irrelevant information comprises: page noise and html tag;
The described text text extraction method of long string comprises: in html page content, find the longest text string, and then front and back expansion, until expand to threshold value, then block, extract, obtain the body matter of text.
Preferably, the computing method of described correlativity comprise: word frequency rule of three; The expression formula of the method is: correlativity=word frequency is proportion * word frequency proportion in all crawl results in this document.
Preferably, describedly by relevance ranking, comprise: described document is divided equally for some grades, and be the corresponding related coefficient mark of each level setting.
Preferably, described calculating DCG is as shown in the formula expression:
In formula, s is the total record of document, and i is the ordinal number of the document place grade, rel
irelated coefficient mark for the document place grade.
Preferably, calculated results in described step F sorted and analyzed, drawing multiple Output rusults, generating report forms; Described Output rusults comprises: the average DCG mark rank of calculated results in step F, total DCG mark rank, the good and bad number rank of Search Results in all keywords.
Compared with prior art, beneficial effect of the present invention is:
1) robotization, without artificial participation, saves a large amount of artificial;
2) quick, can in the short time, obtain evaluation result;
3) flexible, in process of the present invention, many places have adopted configurable pattern, and correlation calculations etc. can also be adjusted voluntarily, therefore have very high dirigibility;
4) a whole set of method can be applicable in multiple vertical search, is not simple Webpage search, also can be used for news search, video search etc.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
According to the analysis of each search engine and user are used to investigation of search engine etc., can confirm user to the concern of search engine mostly accuracy and sequence two aspect, accuracy is that the content demonstrating in order to ensure Search Results is that user wants, during sequence for before the result of the demand of being more close to the users is come, allow user not need drop-down or page turning just can directly find the content of wanting, so the present invention is mainly usingd this two aspect and as starting point, is realized the evaluation and test of the result of each search engine.
Concrete steps are as follows:
1) select evaluation and test word
The quality of choosing of evaluation and test word is directly determining the compatible degree of evaluation result and actual effect, in order to make evaluation and test can cover more number of searches, the present invention's acquiescence is chosen 3000 high frequency words in search-engine results as evaluation and test sample, and these words can extract from user's search seniority among brothers and sisters.In the scope of word, select and quantity is selected, can change according to actual conditions, if evaluation and test Webpage search is chosen page searched key word, if video search, choose film title that high frequency searches for or performer etc.
2) capture each search engine search results
Result of study to user behavior shows, most of users are only concerned about first 2 pages of Search Results, namely general 40, so the present invention gives tacit consent to front 40 data that capture in Search Results and researchs and analyses (data number can configure according to demand voluntarily).For returning results in search engine, great majority can return to link and the summary of source address, and owing to being incomplete result, the present invention will carry out secondary crawl, goes source address to capture complete results page, for calculating the degree of correlation between this page and search word.
The detailed process of this twice crawl is, first according to keyword, generate the Search Results link of search engine, capture for the first time, by the template of regular expression, from each search engine, extract the link of relevant information He each results page details of each result, and save, this connects for secondary crawl.
Capturing is for the second time to obtain the link of results page details the result from capturing for the first time, and captures respective page, saves in order, offers step 3 and uses.
3) text extracts
From source address, capture the results page of coming, mostly have the noises such as advertisement, therefore before the correlativity of result of calculation, we will extract the body matter of result page, in order to avoid these noises impact result of calculation.
On context extraction method, can adopt HTML extracting method based on dom tree or text the common methods such as text extraction method of long string obtain the text in results page, and calculate according to this correlativity between this piece of article and searched key word.
First the HTML extracting method of dom tree converts html text to a dom tree, then according to the node analysis of dom tree, extracts the content that text is relevant, removes the irrelevant informations such as page noise and html tag; The emphasis of this method is when dom tree is imperfect, how correctly to repair dom tree.
Text the extracting method of long string to be applicable to text be this page of long article; First in HTML content, find the longest text string, and then front and back expansion, until expand to threshold value, then block, extract, obtain the body matter of text.
4) calculate correlativity
The calculating of correlativity is key one ring in flow process of the present invention, and previous step 2 and step 3 are all prepared in order to calculate correlativity, only has the correlativity of correctly calculating each Search Results and searched key word, the correctness of guarantee final appraisal results.
Choosing of correlation calculations rule, also can change to some extent according to different vertical searches: if Webpage search is more focused on content matching degree, if news search, need to pay close attention to content matching and time, if video search is more paid close attention to title and annotation etc. simultaneously.
In the present invention, the algorithm of correlativity can be adjusted flexibly, the result of artificial evaluation and test of small part of can take is sample, by the method for machine learning, dynamically adjusts the required weights of correlation calculations, also can directly adopt the relevance algorithms of some moulding.
For example, in the test of news search, adopted word frequency ratio method to calculate the correlativity of plain text, specific algorithm is correlativity=word frequency proportion * word frequency proportion in all crawl results in this document, that is:
Wherein,
Being opened 3 powers is for balance and P(D) between weight;
In formula, n is for cutting word quantity after word, and number of times N(i) occurring for word i, L(i) is the length of word i, is L(T) length transcript;
In formula, number of times T(i) occurring in all Search Results of all search engines for word i;
The correlativity of time has adopted the mode of reciprocal curve, for
In formula, T (n) is current time, and T (t) is the cloth time, and molecule W is weighted value, is used for balance P(M) and P(T) between weight;
Final correlativity has adopted both harmonic-means to calculate,
The weight that can improve so low that of correlativity, makes result more trend towards actual conditions.
5) integrate and press relevance ranking
Step 4 is calculated correlativity for each piece of result document, here all result document of single searched key word being returned on all search engines are integrated, according to correlativity, sort, then result is equally divided into excellent-in-differ from three classes and (can be divided into multiclass by different demands herein, for automation mechanized operation), and give the corresponding related coefficient mark of each class (if the N class that is set as 3-1 minute, mark is N-1) offer DCG computing formula, allow it calculate final DCG mark.
6) calculate DCG
DCG is a kind of good and bad evaluating method of sequence of verifying, the document that correlativity is high come result page before, mark will be high, otherwise, correlativity is low come before, mark will be low.The DCG computing formula of s piece of writing document is:
Step 5 sorts the Search Results of single searched key word, and has distributed corresponding related coefficient mark for every piece of document, namely the reli in formula.Then all results of this keyword are pressed to search engine grouping, in single search engine group, rank i according to all results in its search engine calculates the DCG gross score of this keyword in this search engine with formula, calculates all groups just obtained the DCG mark of this keyword in each search engine with this.
In the computation process of DCG, there is following several situation:
1. the result of search engine A is generally better than search engine B, but sequence does not have B good, now due to rel
agenerally higher than rel
bso, the result of DCG be A higher than B, meet logic.
2. the results relevance of the result of search engine A and search engine B is similar, but the sequence of A is better, now the high rel of mark
bthe sort algorithm 1/log that can be ranked behind
2i drags down, and the whole DCG that causes B lower than A, meets logic.
3. the result of search engine A is better than search engine B, sorts also good than B, and the DCG of A, certainly higher than B, meets logic.
These 3 kinds of situations have all proved in implementation procedure of the present invention, and the result of DCG can be used as the standard of evaluation and test search-engine results quality.
7) by DCG result, sort, sum up evaluation result
Acquired results in step 6 is sorted and at length analyzed, can obtain multiple Output rusults, as resultful average DCG mark rank, total DCG mark rank, the good and bad number rank of Search Results etc. in all keywords, generating report forms, so that contrast is checked intuitively.
Adopt method of the present invention can obtain quickly and easily evaluation result, the plenty of time and the manpower consumption that have avoided artificial evaluation and test to bring completely.With the news search in vertical search, test, choose 3000 hot words of news, Baidu, search dog, in search, 4 search engines of Yahoo (Google evaluates and tests target because the problems such as frequent shielding do not add), each search engine is chosen 40 Search Results, and the evaluation and test time is approximately 2 hours (bottleneck is that webpage captures); By discovery after the result contrast of acquired results and manually evaluation and test, the result difference of evaluation result of the present invention and artificial evaluation and test is in 5%.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any modification of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.