CN103544307B

CN103544307B - A kind of multiple search engine automation contrast evaluating method independent of document library

Info

Publication number: CN103544307B
Application number: CN201310538069.5A
Authority: CN
Inventors: 张鹏飞; 赵毅强
Original assignee: Beijing Zhongsou Cloud Business Network Technology Co Ltd
Current assignee: Beijing Zhongsou Cloud Business Network Technology Co ltd
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2017-08-08
Anticipated expiration: 2033-11-04
Also published as: CN103544307A

Abstract

The present invention provides a kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described comprises the following steps：A. word is evaluated and tested in selection；B. capture search result and save as document；C. document text is extracted；D. correlation is calculated；E. document is integrated and by its relevance ranking；F. DCG is calculated；G. it is ranked up by DCG results, summarizes evaluation result.Following effect can be reached by the present invention：Automation, without artificial participation, saves a large amount of artificial；Quickly, evaluation result can be obtained in the short time；Flexibly, in process of the invention, many places employ configurable pattern, and correlation calculations etc. can also be adjusted voluntarily, therefore with very high flexibility；It can be applied in a variety of vertical searches, more than simple Webpage search, it can also be used to news search, video search etc..

Description

A kind of multiple search engine automation contrast evaluating method independent of document library

Technical field

The invention belongs to searching engine field, and in particular to a kind of multiple search engine independent of document library automates contrast Evaluating method.

Background technology

Under network environment of today, search engine has turned into the essential instrument of netizen；In internet, there is many Search engine.In terms of each search-engine results are contrasted, mainly there are two methods：One kind is artificial selection some keywords each Scanned on search engine, obtain results page, every search result is given a mark, then alignment score is each to evaluate and test out Quality between search engine；Another method is to rely on document library, and each search engine is carried out according to accuracy rate and recall rate The evaluation and test of algorithm.

The result of artificial evaluation and test search engine needs to expend substantial amounts of human resources and time.If a search engine , it is necessary to continually be evaluated and tested in the state of optimization, this undoubtedly can bring huge problem to artificial evaluation and test, make artificial evaluation and test Become unrealistic.

The method of dependence document library is only used for the search engine under line, because the document library between each search engine is different, It can not be evaluated and tested to the search engine run on line.

The content of the invention

In order to overcome the above-mentioned deficiencies of the prior art, present invention offer is a kind of can be automatically rapidly by searching on line Index holds up the method evaluated, and can contrast the result difference between each search engine by this method, be appropriate for each search Whether the algorithm that the regular contrast between engine is evaluated and tested and continually evaluated and tested in Optimizing Search engine to check optimization succeeds.

In order to realize foregoing invention purpose, the present invention is adopted the following technical scheme that：

A kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described bag Include following steps：

A. word is evaluated and tested in selection；

B. capture search result and save as document；

C. document text is extracted；

D. correlation is calculated；

E. document is integrated and by its relevance ranking；

F. DCG is calculated；

G. it is ranked up by DCG results, summarizes evaluation result.

Preferably, the evaluation and test word includes：The film title in page search keyword, video search in Webpage search Or actor name.

Preferably, it is characterised in that the crawl includes capturing process twice；

Crawl includes for the first time：The search result link of search engine is generated according to keyword, first time crawl is carried out, uses Template extracts the link of the relevant information and each results page details of each result from each search engine, and preserves； The template is to include the regular expression of search condition；

Second of crawl includes：The link of the results page details obtained in being captured according to first time captures corresponding page Face, and save as document respectively in order.

Preferably, the extracting method of the text includes：HTML extracting methods, text based on dom tree most it is long string of just Literary extraction method；

The HTML extracting methods based on dom tree include：Html text is changed into a dom tree, then according to DOM The node of tree is analyzed to extract the content that text is related, to remove irrelevant information in the page；The irrelevant information includes：Page noise And html tag；

The most long string of text extraction method of the text includes：Most long text string is found in html page content, then Front and rear extension again, until expanding to threshold value, then is blocked, and is extracted, is obtained the body matter of text.

Preferably, the computational methods of the correlation include：Word frequency rule of three；The expression formula of this method is：Correlation=word Frequency in the document proportion * word frequency it is all crawl results in proportion.

Preferably, it is described to include by relevance ranking：The document is bisected into some grades, and is the setting of each grade Corresponding coefficient correlation fraction.

Preferably, the calculating DCG such as following formulas are expressed：

In formula, s is the total record of document, and i is the ordinal number of grade where the document, rel_iThe correlation of grade where the document Coefficient scores.

Preferably, calculated results in the step F are ranked up and analyzed, draw a variety of output results, generated Form；The output result includes：The average DCG fractions ranking of calculated results in step F, total DCG fractions ranking owns Search result quality number ranking in keyword.

Compared with prior art, the beneficial effects of the present invention are：

1) automate, without artificial participation, save a large amount of artificial；

2) evaluation result quickly, can be obtained in the short time；

3) flexibly, in process of the invention, many places employ configurable pattern, and correlation calculations etc. can also be adjusted voluntarily It is whole, therefore with very high flexibility；

4) overall procedure can be applied in a variety of vertical searches, more than simple Webpage search, it can also be used to which news is searched Rope, video search etc..

Brief description of the drawings

Fig. 1 is test process flow chart of the present invention.

Embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

According to the analysis to each search engine and user using investigation of search engine etc., it can confirm that user to search The concern of engine is mostly in accuracy and the aspect of sequence two, and accuracy is that the content shown in order to ensure search result is user It is desired, during sequence in order to which the result for demand of being more close to the users is come before, user need not pull down or page turning is with regard to energy Directly find desired content, therefore the present invention is main realizes the result of each search engine using these two aspects as starting point Evaluation and test.

Comprise the following steps that：

1) selection evaluation and test word

The quality for evaluating and testing the selection of word directly decides evaluation result and the compatible degree of actual effect, is in order that evaluation and test energy Cover more number of searches, present invention acquiescence chooses high frequency words in 3000 search-engine results as evaluation and test sample, this A little words can be extracted from the search seniority among brothers and sisters of user., can be according to reality in the scope selection and quantity selection of word Border situation is changed, if evaluation and test Webpage search, then choose page search keyword, if video search, then choose The film title of high frequency search or performer etc..

2) each search engine search results are captured

Result of study to user behavior shows that most of users are only concerned first page 2 of search result, that is, general 40 It is individual, therefore preceding 40 data in present invention acquiescence crawl search result researched and analysed（Number of data can be according to demand Voluntarily configure）.It is most of to return to link and the summary of source address for the returning result in search engine, due to being not Complete result, the present invention will carry out secondary crawl, go source address to capture complete results page, for calculating the page and search The degree of correlation between word.

This detailed process captured twice is, the search result for first generating search engine according to keyword is linked, and carries out the Once capture, extract the relevant information and each results page of each result from each search engine with the template of regular expression The link of details, and save, this is connected to secondary crawl.

Second of crawl is the link that results page details are obtained from the result of first time crawl, and is captured corresponding The page, saves that there is provided used to step 3 in order.

3) text is extracted

The results page come from source address crawl, has a noises such as advertisement mostly, thus result of calculation correlation it It is preceding that we will extract to the body matter of result page, in case these noises are impacted to result of calculation.

It can be extracted on context extraction method using the most long string of text of HTML extracting methods or text based on dom tree The common methods such as method obtain the text in results page, and related between this article and search keyword to calculate according to this Property.

Html text is converted into a dom tree by the HTML extracting methods of dom tree first, then according to the node of dom tree Analyze to extract the content that text is related, remove the irrelevant information such as page noise and html tag；The emphasis of this method is to work as DOM How dom tree is correctly repaired when setting imperfect.

The most long string of extracting method of text is applied to the page that text is long text；First found in HTML content most long Text string, then front and rear extension again, until expanding to threshold value, then is blocked, and is extracted, is obtained the body matter of text.

4) correlation is calculated

The calculating of correlation is the crucial ring in flow of the present invention, and previous step 2 and step 3 are provided to calculate phase Close property and prepare, only correctly calculate the correlation of each search result and search keyword, just can guarantee that final evaluation As a result correctness.

In the selection of correlation calculations rule, it can also be changed according to different vertical searches：If webpage Search, then more focus on content matching degree, if news search, then attentinal contents matching simultaneously and time are needed, if regarding Frequency is searched for, then more concern title and annotation etc..

In the present invention, the algorithm of correlation can be adjusted flexibly, can be with the least a portion of result manually evaluated and tested For sample, the weights needed for correlation calculations are dynamically adjusted by the method for machine learning, can also directly using some into The relevance algorithms of type.

For example, in the test of news search, employ word frequency ratio method to calculate the correlation of plain text, it is specific to calculate Method is correlation=word frequency proportion * word frequency proportion in all crawl results in the document, i.e.,：

Wherein,

Opened 3 powers and be to balance and P（D）Between weight；

In formula, n is word quantity, N after cutting word（i）The number of times occurred for word i, L（i）For word i length, L（T）For Length transcript；

In formula, T（i）The number of times occurred for word i in all search results of all search engines；

The correlation of time employs the mode of reciprocal curve, is

In formula, T (n) is current time, and T (t) is the cloth time, and molecule W is weighted value, for balancing P（M）And P（T）Between Weight；

Final correlation employs both harmonic-means to calculate,

That low weight of correlation can be so improved, result is tended to actual conditions.

5) integrate and by relevance ranking

Step 4 is that each result document calculates correlation, here by single search keyword on all search engines All result documents returned are integrated, and are ranked up according to correlation, then by result be equally divided into it is excellent-in-poor three class （Multiclass can be divided into by different demands herein, be automation mechanized operation）, and it is set as 3 to the corresponding coefficient correlation fraction of each class- 1 point (if N classes, then fraction is N -1) to be supplied to DCG calculation formula, allows it to calculate final DCG fractions.

6) DCG is calculated

DCG is a kind of evaluating method for verifying that sequence is good and bad, and the high document of correlation is come before result page, and fraction is just Can be high, otherwise, correlation it is low come before, fraction will be low.The DCG calculation formula of s documents are：

Step 5 is sorted the search result of single search keyword, and is assigned with accordingly for every document Coefficient correlation fraction, that is, the reli in formula.Then all results of the keyword are grouped by search engine, searched single Index is held up in group, is drawn according to ranking i of all results in its search engine to calculate the keyword with formula in the search DCG gross scores in holding up, calculate all groups with this and have just obtained DCG fraction of the keyword in each search engine.

In DCG calculating process, there are following several situations：

1. search engine A result is generally better than search engine B, but no B that sorts is good, now due to rel_AGenerally higher than rel_B, so DCG result, which is A, is higher than B, meet logic.

2. search engine A result and search engine B results relevance are almost, but A sequence is more preferable, now fraction High rel_BCan ranked sort algorithm 1/log rearward₂I is dragged down, and is caused B overall DCG to be less than A, is met logic.

3. search engine A result is better than search engine B, it is also better than B to sort, then A DCG is higher than B certainly, meets and patrols Volume.

This 3 kinds of situations are all demonstrated in the implementation process of the present invention, and DCG result can be used as evaluation and test search and draw Hold up the standard of result quality.

7) it is ranked up by DCG results, summarizes evaluation result

Acquired results in step 6 are ranked up and analyzed in detail, a variety of output results can be obtained, it is such as all As a result search result quality number ranking etc., generation report in average DCG fractions ranking, total DCG fractions ranking, all keywords Table, is checked intuitively to contrast.

Evaluation result can quickly and easily be obtained using the method for the present invention, it is entirely avoided it is big that artificial evaluation and test is brought Amount time and manpower consumption.Tested with the news search in vertical search, choose 3000 news hot words, Baidu is searched Dog, in search, 4 search engines of Yahoo（Google the problems such as often shielding because not adding evaluation and test target）, each search engine selection 40 search results, the evaluation and test time is about 2 hours（Bottleneck is webpage capture）；The result pair by acquired results and manually evaluated and tested Found than after, evaluation result of the invention and the result difference manually evaluated and tested are within 5%.

Finally it should be noted that：The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, to the greatest extent The present invention is described in detail with reference to above-described embodiment for pipe, those of ordinary skills in the art should understand that：Still The embodiment of the present invention can be modified or equivalent substitution, and without departing from any of spirit and scope of the invention Modification or equivalent substitution, it all should cover among scope of the presently claimed invention.

Claims

1. a kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described includes Following steps：

A. word is evaluated and tested in selection；

B. capture search result and save as document；

C. document text is extracted；

D. correlation is calculated；

E. document is integrated and by its relevance ranking；

F. DCG is calculated；

G. it is ranked up by DCG results, summarizes evaluation result；

The evaluation and test word is the high frequency words in 3000 search-engine results of selection；

The computational methods of the correlation include：Word frequency rule of three；The expression formula of this method is：Correlation=word frequency is in this document Middle proportion * word frequency proportion in all crawl results；

The extracting method of the text includes：The most long string of text extraction method of HTML extracting methods, text based on dom tree；

The HTML extracting methods based on dom tree include：Html text is changed into a dom tree, then according to dom tree Node is analyzed to extract the content that text is related, to remove irrelevant information in the page；The irrelevant information includes：Page noise and Html tag；

The most long string of text extraction method of the text includes：Most long text string is found in html page content, then again before After extend, until expanding to threshold value, then blocked, extract, obtain the body matter of text.

2. evaluating method as claimed in claim 1, it is characterised in that the evaluation and test word includes：The page in Webpage search is searched Film title or actor name in rope keyword, video search.

3. evaluating method as claimed in claim 1, it is characterised in that the crawl includes capturing process twice；

Crawl includes for the first time：The search result link of search engine is generated according to keyword, first time crawl is carried out, uses template The link of the relevant information and each results page details of each result is extracted from each search engine, and is preserved；It is described Template is to include the regular expression of search condition；

Second of crawl includes：The link of the results page details obtained in being captured according to first time captures respective page, And save as document respectively in order.

4. evaluating method as claimed in claim 1, it is characterised in that described to include by relevance ranking：The document is put down It is divided into some grades, and corresponding coefficient correlation fraction is set for each grade.

5. evaluating method as claimed in claim 1, it is characterised in that the calculating DCG such as following formulas are expressed：

In formula, s is the total record of document, and i is the ordinal number of grade where the document, rel_iThe coefficient correlation of grade where the document Fraction.

6. evaluating method as claimed in claim 1, it is characterised in that：Calculated results in the step F are ranked up And analyze, a variety of output results are drawn, form is generated；The output result includes：Calculated results is averaged in step F Search result quality number ranking in DCG fraction rankings, total DCG fractions ranking, all keywords.