The content of the invention
In order to overcome the above-mentioned deficiencies of the prior art, present invention offer is a kind of can be automatically rapidly by searching on line
Index holds up the method evaluated, and can contrast the result difference between each search engine by this method, be appropriate for each search
Whether the algorithm that the regular contrast between engine is evaluated and tested and continually evaluated and tested in Optimizing Search engine to check optimization succeeds.
In order to realize foregoing invention purpose, the present invention is adopted the following technical scheme that:
A kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described bag
Include following steps:
A. word is evaluated and tested in selection;
B. capture search result and save as document;
C. document text is extracted;
D. correlation is calculated;
E. document is integrated and by its relevance ranking;
F. DCG is calculated;
G. it is ranked up by DCG results, summarizes evaluation result.
Preferably, the evaluation and test word includes:The film title in page search keyword, video search in Webpage search
Or actor name.
Preferably, it is characterised in that the crawl includes capturing process twice;
Crawl includes for the first time:The search result link of search engine is generated according to keyword, first time crawl is carried out, uses
Template extracts the link of the relevant information and each results page details of each result from each search engine, and preserves;
The template is to include the regular expression of search condition;
Second of crawl includes:The link of the results page details obtained in being captured according to first time captures corresponding page
Face, and save as document respectively in order.
Preferably, the extracting method of the text includes:HTML extracting methods, text based on dom tree most it is long string of just
Literary extraction method;
The HTML extracting methods based on dom tree include:Html text is changed into a dom tree, then according to DOM
The node of tree is analyzed to extract the content that text is related, to remove irrelevant information in the page;The irrelevant information includes:Page noise
And html tag;
The most long string of text extraction method of the text includes:Most long text string is found in html page content, then
Front and rear extension again, until expanding to threshold value, then is blocked, and is extracted, is obtained the body matter of text.
Preferably, the computational methods of the correlation include:Word frequency rule of three;The expression formula of this method is:Correlation=word
Frequency in the document proportion * word frequency it is all crawl results in proportion.
Preferably, it is described to include by relevance ranking:The document is bisected into some grades, and is the setting of each grade
Corresponding coefficient correlation fraction.
Preferably, the calculating DCG such as following formulas are expressed:
In formula, s is the total record of document, and i is the ordinal number of grade where the document, reliThe correlation of grade where the document
Coefficient scores.
Preferably, calculated results in the step F are ranked up and analyzed, draw a variety of output results, generated
Form;The output result includes:The average DCG fractions ranking of calculated results in step F, total DCG fractions ranking owns
Search result quality number ranking in keyword.
Compared with prior art, the beneficial effects of the present invention are:
1) automate, without artificial participation, save a large amount of artificial;
2) evaluation result quickly, can be obtained in the short time;
3) flexibly, in process of the invention, many places employ configurable pattern, and correlation calculations etc. can also be adjusted voluntarily
It is whole, therefore with very high flexibility;
4) overall procedure can be applied in a variety of vertical searches, more than simple Webpage search, it can also be used to which news is searched
Rope, video search etc..
Embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
According to the analysis to each search engine and user using investigation of search engine etc., it can confirm that user to search
The concern of engine is mostly in accuracy and the aspect of sequence two, and accuracy is that the content shown in order to ensure search result is user
It is desired, during sequence in order to which the result for demand of being more close to the users is come before, user need not pull down or page turning is with regard to energy
Directly find desired content, therefore the present invention is main realizes the result of each search engine using these two aspects as starting point
Evaluation and test.
Comprise the following steps that:
1) selection evaluation and test word
The quality for evaluating and testing the selection of word directly decides evaluation result and the compatible degree of actual effect, is in order that evaluation and test energy
Cover more number of searches, present invention acquiescence chooses high frequency words in 3000 search-engine results as evaluation and test sample, this
A little words can be extracted from the search seniority among brothers and sisters of user., can be according to reality in the scope selection and quantity selection of word
Border situation is changed, if evaluation and test Webpage search, then choose page search keyword, if video search, then choose
The film title of high frequency search or performer etc..
2) each search engine search results are captured
Result of study to user behavior shows that most of users are only concerned first page 2 of search result, that is, general 40
It is individual, therefore preceding 40 data in present invention acquiescence crawl search result researched and analysed(Number of data can be according to demand
Voluntarily configure).It is most of to return to link and the summary of source address for the returning result in search engine, due to being not
Complete result, the present invention will carry out secondary crawl, go source address to capture complete results page, for calculating the page and search
The degree of correlation between word.
This detailed process captured twice is, the search result for first generating search engine according to keyword is linked, and carries out the
Once capture, extract the relevant information and each results page of each result from each search engine with the template of regular expression
The link of details, and save, this is connected to secondary crawl.
Second of crawl is the link that results page details are obtained from the result of first time crawl, and is captured corresponding
The page, saves that there is provided used to step 3 in order.
3) text is extracted
The results page come from source address crawl, has a noises such as advertisement mostly, thus result of calculation correlation it
It is preceding that we will extract to the body matter of result page, in case these noises are impacted to result of calculation.
It can be extracted on context extraction method using the most long string of text of HTML extracting methods or text based on dom tree
The common methods such as method obtain the text in results page, and related between this article and search keyword to calculate according to this
Property.
Html text is converted into a dom tree by the HTML extracting methods of dom tree first, then according to the node of dom tree
Analyze to extract the content that text is related, remove the irrelevant information such as page noise and html tag;The emphasis of this method is to work as DOM
How dom tree is correctly repaired when setting imperfect.
The most long string of extracting method of text is applied to the page that text is long text;First found in HTML content most long
Text string, then front and rear extension again, until expanding to threshold value, then is blocked, and is extracted, is obtained the body matter of text.
4) correlation is calculated
The calculating of correlation is the crucial ring in flow of the present invention, and previous step 2 and step 3 are provided to calculate phase
Close property and prepare, only correctly calculate the correlation of each search result and search keyword, just can guarantee that final evaluation
As a result correctness.
In the selection of correlation calculations rule, it can also be changed according to different vertical searches:If webpage
Search, then more focus on content matching degree, if news search, then attentinal contents matching simultaneously and time are needed, if regarding
Frequency is searched for, then more concern title and annotation etc..
In the present invention, the algorithm of correlation can be adjusted flexibly, can be with the least a portion of result manually evaluated and tested
For sample, the weights needed for correlation calculations are dynamically adjusted by the method for machine learning, can also directly using some into
The relevance algorithms of type.
For example, in the test of news search, employ word frequency ratio method to calculate the correlation of plain text, it is specific to calculate
Method is correlation=word frequency proportion * word frequency proportion in all crawl results in the document, i.e.,:
Wherein,
Opened 3 powers and be to balance and P(D)Between weight;
In formula, n is word quantity, N after cutting word(i)The number of times occurred for word i, L(i)For word i length, L(T)For
Length transcript;
In formula, T(i)The number of times occurred for word i in all search results of all search engines;
The correlation of time employs the mode of reciprocal curve, is
In formula, T (n) is current time, and T (t) is the cloth time, and molecule W is weighted value, for balancing P(M)And P(T)Between
Weight;
Final correlation employs both harmonic-means to calculate,
That low weight of correlation can be so improved, result is tended to actual conditions.
5) integrate and by relevance ranking
Step 4 is that each result document calculates correlation, here by single search keyword on all search engines
All result documents returned are integrated, and are ranked up according to correlation, then by result be equally divided into it is excellent-in-poor three class
(Multiclass can be divided into by different demands herein, be automation mechanized operation), and it is set as 3 to the corresponding coefficient correlation fraction of each class-
1 point (if N classes, then fraction is N -1) to be supplied to DCG calculation formula, allows it to calculate final DCG fractions.
6) DCG is calculated
DCG is a kind of evaluating method for verifying that sequence is good and bad, and the high document of correlation is come before result page, and fraction is just
Can be high, otherwise, correlation it is low come before, fraction will be low.The DCG calculation formula of s documents are:
Step 5 is sorted the search result of single search keyword, and is assigned with accordingly for every document
Coefficient correlation fraction, that is, the reli in formula.Then all results of the keyword are grouped by search engine, searched single
Index is held up in group, is drawn according to ranking i of all results in its search engine to calculate the keyword with formula in the search
DCG gross scores in holding up, calculate all groups with this and have just obtained DCG fraction of the keyword in each search engine.
In DCG calculating process, there are following several situations:
1. search engine A result is generally better than search engine B, but no B that sorts is good, now due to relAGenerally higher than
relB, so DCG result, which is A, is higher than B, meet logic.
2. search engine A result and search engine B results relevance are almost, but A sequence is more preferable, now fraction
High relBCan ranked sort algorithm 1/log rearward2I is dragged down, and is caused B overall DCG to be less than A, is met logic.
3. search engine A result is better than search engine B, it is also better than B to sort, then A DCG is higher than B certainly, meets and patrols
Volume.
This 3 kinds of situations are all demonstrated in the implementation process of the present invention, and DCG result can be used as evaluation and test search and draw
Hold up the standard of result quality.
7) it is ranked up by DCG results, summarizes evaluation result
Acquired results in step 6 are ranked up and analyzed in detail, a variety of output results can be obtained, it is such as all
As a result search result quality number ranking etc., generation report in average DCG fractions ranking, total DCG fractions ranking, all keywords
Table, is checked intuitively to contrast.
Evaluation result can quickly and easily be obtained using the method for the present invention, it is entirely avoided it is big that artificial evaluation and test is brought
Amount time and manpower consumption.Tested with the news search in vertical search, choose 3000 news hot words, Baidu is searched
Dog, in search, 4 search engines of Yahoo(Google the problems such as often shielding because not adding evaluation and test target), each search engine selection
40 search results, the evaluation and test time is about 2 hours(Bottleneck is webpage capture);The result pair by acquired results and manually evaluated and tested
Found than after, evaluation result of the invention and the result difference manually evaluated and tested are within 5%.
Finally it should be noted that:The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, to the greatest extent
The present invention is described in detail with reference to above-described embodiment for pipe, those of ordinary skills in the art should understand that:Still
The embodiment of the present invention can be modified or equivalent substitution, and without departing from any of spirit and scope of the invention
Modification or equivalent substitution, it all should cover among scope of the presently claimed invention.