CN103544307B - A kind of multiple search engine automation contrast evaluating method independent of document library - Google Patents

A kind of multiple search engine automation contrast evaluating method independent of document library Download PDF

Info

Publication number
CN103544307B
CN103544307B CN201310538069.5A CN201310538069A CN103544307B CN 103544307 B CN103544307 B CN 103544307B CN 201310538069 A CN201310538069 A CN 201310538069A CN 103544307 B CN103544307 B CN 103544307B
Authority
CN
China
Prior art keywords
search
document
text
results
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310538069.5A
Other languages
Chinese (zh)
Other versions
CN103544307A (en
Inventor
张鹏飞
赵毅强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Cloud Business Network Technology Co ltd
Original Assignee
Beijing Zhongsou Cloud Business Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Cloud Business Network Technology Co Ltd filed Critical Beijing Zhongsou Cloud Business Network Technology Co Ltd
Priority to CN201310538069.5A priority Critical patent/CN103544307B/en
Publication of CN103544307A publication Critical patent/CN103544307A/en
Application granted granted Critical
Publication of CN103544307B publication Critical patent/CN103544307B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Abstract

The present invention provides a kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described comprises the following steps:A. word is evaluated and tested in selection;B. capture search result and save as document;C. document text is extracted;D. correlation is calculated;E. document is integrated and by its relevance ranking;F. DCG is calculated;G. it is ranked up by DCG results, summarizes evaluation result.Following effect can be reached by the present invention:Automation, without artificial participation, saves a large amount of artificial;Quickly, evaluation result can be obtained in the short time;Flexibly, in process of the invention, many places employ configurable pattern, and correlation calculations etc. can also be adjusted voluntarily, therefore with very high flexibility;It can be applied in a variety of vertical searches, more than simple Webpage search, it can also be used to news search, video search etc..

Description

A kind of multiple search engine automation contrast evaluating method independent of document library
Technical field
The invention belongs to searching engine field, and in particular to a kind of multiple search engine independent of document library automates contrast Evaluating method.
Background technology
Under network environment of today, search engine has turned into the essential instrument of netizen;In internet, there is many Search engine.In terms of each search-engine results are contrasted, mainly there are two methods:One kind is artificial selection some keywords each Scanned on search engine, obtain results page, every search result is given a mark, then alignment score is each to evaluate and test out Quality between search engine;Another method is to rely on document library, and each search engine is carried out according to accuracy rate and recall rate The evaluation and test of algorithm.
The result of artificial evaluation and test search engine needs to expend substantial amounts of human resources and time.If a search engine , it is necessary to continually be evaluated and tested in the state of optimization, this undoubtedly can bring huge problem to artificial evaluation and test, make artificial evaluation and test Become unrealistic.
The method of dependence document library is only used for the search engine under line, because the document library between each search engine is different, It can not be evaluated and tested to the search engine run on line.
The content of the invention
In order to overcome the above-mentioned deficiencies of the prior art, present invention offer is a kind of can be automatically rapidly by searching on line Index holds up the method evaluated, and can contrast the result difference between each search engine by this method, be appropriate for each search Whether the algorithm that the regular contrast between engine is evaluated and tested and continually evaluated and tested in Optimizing Search engine to check optimization succeeds.
In order to realize foregoing invention purpose, the present invention is adopted the following technical scheme that:
A kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described bag Include following steps:
A. word is evaluated and tested in selection;
B. capture search result and save as document;
C. document text is extracted;
D. correlation is calculated;
E. document is integrated and by its relevance ranking;
F. DCG is calculated;
G. it is ranked up by DCG results, summarizes evaluation result.
Preferably, the evaluation and test word includes:The film title in page search keyword, video search in Webpage search Or actor name.
Preferably, it is characterised in that the crawl includes capturing process twice;
Crawl includes for the first time:The search result link of search engine is generated according to keyword, first time crawl is carried out, uses Template extracts the link of the relevant information and each results page details of each result from each search engine, and preserves; The template is to include the regular expression of search condition;
Second of crawl includes:The link of the results page details obtained in being captured according to first time captures corresponding page Face, and save as document respectively in order.
Preferably, the extracting method of the text includes:HTML extracting methods, text based on dom tree most it is long string of just Literary extraction method;
The HTML extracting methods based on dom tree include:Html text is changed into a dom tree, then according to DOM The node of tree is analyzed to extract the content that text is related, to remove irrelevant information in the page;The irrelevant information includes:Page noise And html tag;
The most long string of text extraction method of the text includes:Most long text string is found in html page content, then Front and rear extension again, until expanding to threshold value, then is blocked, and is extracted, is obtained the body matter of text.
Preferably, the computational methods of the correlation include:Word frequency rule of three;The expression formula of this method is:Correlation=word Frequency in the document proportion * word frequency it is all crawl results in proportion.
Preferably, it is described to include by relevance ranking:The document is bisected into some grades, and is the setting of each grade Corresponding coefficient correlation fraction.
Preferably, the calculating DCG such as following formulas are expressed:
In formula, s is the total record of document, and i is the ordinal number of grade where the document, reliThe correlation of grade where the document Coefficient scores.
Preferably, calculated results in the step F are ranked up and analyzed, draw a variety of output results, generated Form;The output result includes:The average DCG fractions ranking of calculated results in step F, total DCG fractions ranking owns Search result quality number ranking in keyword.
Compared with prior art, the beneficial effects of the present invention are:
1) automate, without artificial participation, save a large amount of artificial;
2) evaluation result quickly, can be obtained in the short time;
3) flexibly, in process of the invention, many places employ configurable pattern, and correlation calculations etc. can also be adjusted voluntarily It is whole, therefore with very high flexibility;
4) overall procedure can be applied in a variety of vertical searches, more than simple Webpage search, it can also be used to which news is searched Rope, video search etc..
Brief description of the drawings
Fig. 1 is test process flow chart of the present invention.
Embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
According to the analysis to each search engine and user using investigation of search engine etc., it can confirm that user to search The concern of engine is mostly in accuracy and the aspect of sequence two, and accuracy is that the content shown in order to ensure search result is user It is desired, during sequence in order to which the result for demand of being more close to the users is come before, user need not pull down or page turning is with regard to energy Directly find desired content, therefore the present invention is main realizes the result of each search engine using these two aspects as starting point Evaluation and test.
Comprise the following steps that:
1) selection evaluation and test word
The quality for evaluating and testing the selection of word directly decides evaluation result and the compatible degree of actual effect, is in order that evaluation and test energy Cover more number of searches, present invention acquiescence chooses high frequency words in 3000 search-engine results as evaluation and test sample, this A little words can be extracted from the search seniority among brothers and sisters of user., can be according to reality in the scope selection and quantity selection of word Border situation is changed, if evaluation and test Webpage search, then choose page search keyword, if video search, then choose The film title of high frequency search or performer etc..
2) each search engine search results are captured
Result of study to user behavior shows that most of users are only concerned first page 2 of search result, that is, general 40 It is individual, therefore preceding 40 data in present invention acquiescence crawl search result researched and analysed(Number of data can be according to demand Voluntarily configure).It is most of to return to link and the summary of source address for the returning result in search engine, due to being not Complete result, the present invention will carry out secondary crawl, go source address to capture complete results page, for calculating the page and search The degree of correlation between word.
This detailed process captured twice is, the search result for first generating search engine according to keyword is linked, and carries out the Once capture, extract the relevant information and each results page of each result from each search engine with the template of regular expression The link of details, and save, this is connected to secondary crawl.
Second of crawl is the link that results page details are obtained from the result of first time crawl, and is captured corresponding The page, saves that there is provided used to step 3 in order.
3) text is extracted
The results page come from source address crawl, has a noises such as advertisement mostly, thus result of calculation correlation it It is preceding that we will extract to the body matter of result page, in case these noises are impacted to result of calculation.
It can be extracted on context extraction method using the most long string of text of HTML extracting methods or text based on dom tree The common methods such as method obtain the text in results page, and related between this article and search keyword to calculate according to this Property.
Html text is converted into a dom tree by the HTML extracting methods of dom tree first, then according to the node of dom tree Analyze to extract the content that text is related, remove the irrelevant information such as page noise and html tag;The emphasis of this method is to work as DOM How dom tree is correctly repaired when setting imperfect.
The most long string of extracting method of text is applied to the page that text is long text;First found in HTML content most long Text string, then front and rear extension again, until expanding to threshold value, then is blocked, and is extracted, is obtained the body matter of text.
4) correlation is calculated
The calculating of correlation is the crucial ring in flow of the present invention, and previous step 2 and step 3 are provided to calculate phase Close property and prepare, only correctly calculate the correlation of each search result and search keyword, just can guarantee that final evaluation As a result correctness.
In the selection of correlation calculations rule, it can also be changed according to different vertical searches:If webpage Search, then more focus on content matching degree, if news search, then attentinal contents matching simultaneously and time are needed, if regarding Frequency is searched for, then more concern title and annotation etc..
In the present invention, the algorithm of correlation can be adjusted flexibly, can be with the least a portion of result manually evaluated and tested For sample, the weights needed for correlation calculations are dynamically adjusted by the method for machine learning, can also directly using some into The relevance algorithms of type.
For example, in the test of news search, employ word frequency ratio method to calculate the correlation of plain text, it is specific to calculate Method is correlation=word frequency proportion * word frequency proportion in all crawl results in the document, i.e.,:
Wherein,
Opened 3 powers and be to balance and P(D)Between weight;
In formula, n is word quantity, N after cutting word(i)The number of times occurred for word i, L(i)For word i length, L(T)For Length transcript;
In formula, T(i)The number of times occurred for word i in all search results of all search engines;
The correlation of time employs the mode of reciprocal curve, is
In formula, T (n) is current time, and T (t) is the cloth time, and molecule W is weighted value, for balancing P(M)And P(T)Between Weight;
Final correlation employs both harmonic-means to calculate,
That low weight of correlation can be so improved, result is tended to actual conditions.
5) integrate and by relevance ranking
Step 4 is that each result document calculates correlation, here by single search keyword on all search engines All result documents returned are integrated, and are ranked up according to correlation, then by result be equally divided into it is excellent-in-poor three class (Multiclass can be divided into by different demands herein, be automation mechanized operation), and it is set as 3 to the corresponding coefficient correlation fraction of each class- 1 point (if N classes, then fraction is N -1) to be supplied to DCG calculation formula, allows it to calculate final DCG fractions.
6) DCG is calculated
DCG is a kind of evaluating method for verifying that sequence is good and bad, and the high document of correlation is come before result page, and fraction is just Can be high, otherwise, correlation it is low come before, fraction will be low.The DCG calculation formula of s documents are:
Step 5 is sorted the search result of single search keyword, and is assigned with accordingly for every document Coefficient correlation fraction, that is, the reli in formula.Then all results of the keyword are grouped by search engine, searched single Index is held up in group, is drawn according to ranking i of all results in its search engine to calculate the keyword with formula in the search DCG gross scores in holding up, calculate all groups with this and have just obtained DCG fraction of the keyword in each search engine.
In DCG calculating process, there are following several situations:
1. search engine A result is generally better than search engine B, but no B that sorts is good, now due to relAGenerally higher than relB, so DCG result, which is A, is higher than B, meet logic.
2. search engine A result and search engine B results relevance are almost, but A sequence is more preferable, now fraction High relBCan ranked sort algorithm 1/log rearward2I is dragged down, and is caused B overall DCG to be less than A, is met logic.
3. search engine A result is better than search engine B, it is also better than B to sort, then A DCG is higher than B certainly, meets and patrols Volume.
This 3 kinds of situations are all demonstrated in the implementation process of the present invention, and DCG result can be used as evaluation and test search and draw Hold up the standard of result quality.
7) it is ranked up by DCG results, summarizes evaluation result
Acquired results in step 6 are ranked up and analyzed in detail, a variety of output results can be obtained, it is such as all As a result search result quality number ranking etc., generation report in average DCG fractions ranking, total DCG fractions ranking, all keywords Table, is checked intuitively to contrast.
Evaluation result can quickly and easily be obtained using the method for the present invention, it is entirely avoided it is big that artificial evaluation and test is brought Amount time and manpower consumption.Tested with the news search in vertical search, choose 3000 news hot words, Baidu is searched Dog, in search, 4 search engines of Yahoo(Google the problems such as often shielding because not adding evaluation and test target), each search engine selection 40 search results, the evaluation and test time is about 2 hours(Bottleneck is webpage capture);The result pair by acquired results and manually evaluated and tested Found than after, evaluation result of the invention and the result difference manually evaluated and tested are within 5%.
Finally it should be noted that:The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, to the greatest extent The present invention is described in detail with reference to above-described embodiment for pipe, those of ordinary skills in the art should understand that:Still The embodiment of the present invention can be modified or equivalent substitution, and without departing from any of spirit and scope of the invention Modification or equivalent substitution, it all should cover among scope of the presently claimed invention.

Claims (6)

1. a kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described includes Following steps:
A. word is evaluated and tested in selection;
B. capture search result and save as document;
C. document text is extracted;
D. correlation is calculated;
E. document is integrated and by its relevance ranking;
F. DCG is calculated;
G. it is ranked up by DCG results, summarizes evaluation result;
The evaluation and test word is the high frequency words in 3000 search-engine results of selection;
The computational methods of the correlation include:Word frequency rule of three;The expression formula of this method is:Correlation=word frequency is in this document Middle proportion * word frequency proportion in all crawl results;
The extracting method of the text includes:The most long string of text extraction method of HTML extracting methods, text based on dom tree;
The HTML extracting methods based on dom tree include:Html text is changed into a dom tree, then according to dom tree Node is analyzed to extract the content that text is related, to remove irrelevant information in the page;The irrelevant information includes:Page noise and Html tag;
The most long string of text extraction method of the text includes:Most long text string is found in html page content, then again before After extend, until expanding to threshold value, then blocked, extract, obtain the body matter of text.
2. evaluating method as claimed in claim 1, it is characterised in that the evaluation and test word includes:The page in Webpage search is searched Film title or actor name in rope keyword, video search.
3. evaluating method as claimed in claim 1, it is characterised in that the crawl includes capturing process twice;
Crawl includes for the first time:The search result link of search engine is generated according to keyword, first time crawl is carried out, uses template The link of the relevant information and each results page details of each result is extracted from each search engine, and is preserved;It is described Template is to include the regular expression of search condition;
Second of crawl includes:The link of the results page details obtained in being captured according to first time captures respective page, And save as document respectively in order.
4. evaluating method as claimed in claim 1, it is characterised in that described to include by relevance ranking:The document is put down It is divided into some grades, and corresponding coefficient correlation fraction is set for each grade.
5. evaluating method as claimed in claim 1, it is characterised in that the calculating DCG such as following formulas are expressed:
In formula, s is the total record of document, and i is the ordinal number of grade where the document, reliThe coefficient correlation of grade where the document Fraction.
6. evaluating method as claimed in claim 1, it is characterised in that:Calculated results in the step F are ranked up And analyze, a variety of output results are drawn, form is generated;The output result includes:Calculated results is averaged in step F Search result quality number ranking in DCG fraction rankings, total DCG fractions ranking, all keywords.
CN201310538069.5A 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library Expired - Fee Related CN103544307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538069.5A CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538069.5A CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Publications (2)

Publication Number Publication Date
CN103544307A CN103544307A (en) 2014-01-29
CN103544307B true CN103544307B (en) 2017-08-08

Family

ID=49967759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538069.5A Expired - Fee Related CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Country Status (1)

Country Link
CN (1) CN103544307B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808601B (en) * 2014-12-31 2019-07-23 北京奇虎科技有限公司 Assessment search engine resource includes the calculation method and device of loss
CN104699825B (en) * 2015-03-30 2016-10-05 北京奇虎科技有限公司 The balancing method of Performance of Search Engine and device
CN104699830B (en) * 2015-03-30 2017-04-12 北京奇虎科技有限公司 Method and device for evaluating search engine ordering algorithm effectiveness
CN106227762B (en) * 2016-07-15 2019-06-28 苏群 A kind of method for vertical search and system based on user's assistance
CN107704467B (en) * 2016-08-09 2021-08-24 百度在线网络技术(北京)有限公司 Search quality evaluation method and device
CN106776299A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 Search engine test device and method
WO2018187949A1 (en) * 2017-04-12 2018-10-18 邹霞 Perspective analysis method for machine learning model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720870B2 (en) * 2007-12-18 2010-05-18 Yahoo! Inc. Method and system for quantifying the quality of search results based on cohesion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system

Also Published As

Publication number Publication date
CN103544307A (en) 2014-01-29

Similar Documents

Publication Publication Date Title
CN103544307B (en) A kind of multiple search engine automation contrast evaluating method independent of document library
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN106156372B (en) A kind of classification method and device of internet site
CN108415902A (en) A kind of name entity link method based on search engine
US20070294223A1 (en) Text Categorization Using External Knowledge
CN105095187A (en) Search intention identification method and device
JP6428795B2 (en) Model generation method, word weighting method, model generation device, word weighting device, device, computer program, and computer storage medium
EP2041669A2 (en) Text categorization using external knowledge
WO2021082123A1 (en) Information recommendation method and apparatus, and electronic device
CN110309446A (en) The quick De-weight method of content of text, device, computer equipment and storage medium
CN104361037B (en) Microblogging sorting technique and device
CN103473317A (en) Method and equipment for extracting keywords
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN110555154B (en) Theme-oriented information retrieval method
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
CN115329085A (en) Social robot classification method and system
CN111222031A (en) Website distinguishing method and system
CN112328469B (en) Function level defect positioning method based on embedding technology
Qi et al. Measuring similarity to detect qualified links
Yuan et al. A mathematical information retrieval system based on RankBoost
CN103324720A (en) Personalized recommendation method and system according to user state
CN114238735B (en) Intelligent internet data acquisition method
CN105608183A (en) Method and apparatus for providing answer of aggregation type
CN103823847A (en) Keyword extension method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170427

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Applicant after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY Co.,Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Applicant before: BEIJING ZHONGSOU NETWORK TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170808

Termination date: 20211104