CN103544307A - Multi-search-engine automatic comparison and evaluation method independent of document library - Google Patents

Multi-search-engine automatic comparison and evaluation method independent of document library Download PDF

Info

Publication number
CN103544307A
CN103544307A CN201310538069.5A CN201310538069A CN103544307A CN 103544307 A CN103544307 A CN 103544307A CN 201310538069 A CN201310538069 A CN 201310538069A CN 103544307 A CN103544307 A CN 103544307A
Authority
CN
China
Prior art keywords
search
results
document
text
dcg
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310538069.5A
Other languages
Chinese (zh)
Other versions
CN103544307B (en
Inventor
张鹏飞
赵毅强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Cloud Business Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201310538069.5A priority Critical patent/CN103544307B/en
Publication of CN103544307A publication Critical patent/CN103544307A/en
Application granted granted Critical
Publication of CN103544307B publication Critical patent/CN103544307B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multi-search-engine automatic comparison and evaluation method independent of a document library. The multi-search-engine automatic comparison and evaluation method independent of the document library is characterized by comprising the following steps of A, selecting an evaluation word; B, capturing search results and storing the search results as documents; C, extracting the bodies of the documents; D, calculating correlations; E, integrating the documents and arranging the documents in order according to the correlations of the documents; F, calculating DCG (discounted cumulative gain); G; arranging the order according to DCG results and summarizing evaluation results. The multi-search-engine automatic comparison and evaluation method independent of the document library has the advantages of achieving automation, needing no participation of human and saving large amounts of manpower; being fast and capable of obtaining the evaluation results in a short time; being highly flexible due to the fact that configurable modes are utilized in many parts and correlation calculation and the like can be self-adjusted; being capable of being applied to various vertical searches, not just to web searches, and also being applicable to news searches and video searches and the like.

Description

A kind of multiple search engine robotization contrast evaluating method that does not rely on document library
Technical field
The invention belongs to searching engine field, be specifically related to a kind of multiple search engine robotization contrast evaluating method that does not rely on document library.
Background technology
Net environment of today, search engine has become the requisite instrument of netizen; In internet, there are many search engines.Aspect each search-engine results of contrast, mainly contain two kinds of methods: a kind of is that some keywords of artificial selection are at the enterprising line search of each search engine, obtain results page, every Search Results is given a mark, then compare mark and evaluate and test out the quality between each search engine; Another kind method is to rely on document library, according to accuracy rate and recall rate, carries out the evaluation and test of each search engine algorithms.
The result of artificial evaluation and test search engine need to expend a large amount of human resources and time.If in search engine state in optimizing, need to evaluate and test continually, this bring a huge difficult problem can to undoubtedly artificial evaluation and test, make artificial evaluation and test become unrealistic.
The method that relies on document library can only be for the search engine under line, and because the document library between each search engine is different, it cannot be evaluated and tested the search engine moving on line.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of method that can rapidly the search engine on line be evaluated to robotization, by the method, can contrast the result difference between each search engine, be applicable to carrying out the regular contrast evaluation and test between each search engine and when Optimizing Search engine, evaluate and test continually to check that whether the algorithm of optimization is successful.
In order to realize foregoing invention object, the present invention takes following technical scheme:
A multiple search engine robotization contrast evaluating method that relies on document library, is characterized in that, described method comprises the steps:
A. select evaluation and test word;
B. capture Search Results and save as document;
C. extract document text;
D. calculate correlativity;
E. integrate document and by its relevance ranking;
F. calculate DCG;
G. by DCG result, sort, sum up evaluation result.
Preferably, described evaluation and test word comprises: the page searched key word in Webpage search, film title or the actor name in video search.
Preferably, it is characterized in that, described crawl comprises crawl process twice;
Capture and comprise for the first time: according to keyword, generate the Search Results link of search engine, capture for the first time, by template, from each search engine, extract the link of relevant information He each results page details of each result, and preserve; Described template is the regular expression that comprises search condition;
Capture and comprise for the second time: according to the link of the results page details that obtain in capturing for the first time, capture respective page, and save as respectively in order document.
Preferably, the extracting method of described text comprises: the HTML extracting method based on dom tree, text be the text extraction method of long string;
The described HTML extracting method based on dom tree comprises: html text is changed into a dom tree, then according to the node analysis of dom tree, extract the content that text is relevant, to remove irrelevant information in the page; This irrelevant information comprises: page noise and html tag;
The described text text extraction method of long string comprises: in html page content, find the longest text string, and then front and back expansion, until expand to threshold value, then block, extract, obtain the body matter of text.
Preferably, the computing method of described correlativity comprise: word frequency rule of three; The expression formula of the method is: correlativity=word frequency is proportion * word frequency proportion in all crawl results in this document.
Preferably, describedly by relevance ranking, comprise: described document is divided equally for some grades, and be the corresponding related coefficient mark of each level setting.
Preferably, described calculating DCG is as shown in the formula expression:
DCG s = rel 1 + Σ i = 2 s rel i log 2 i
In formula, s is the total record of document, and i is the ordinal number of the document place grade, rel irelated coefficient mark for the document place grade.
Preferably, calculated results in described step F sorted and analyzed, drawing multiple Output rusults, generating report forms; Described Output rusults comprises: the average DCG mark rank of calculated results in step F, total DCG mark rank, the good and bad number rank of Search Results in all keywords.
Compared with prior art, beneficial effect of the present invention is:
1) robotization, without artificial participation, saves a large amount of artificial;
2) quick, can in the short time, obtain evaluation result;
3) flexible, in process of the present invention, many places have adopted configurable pattern, and correlation calculations etc. can also be adjusted voluntarily, therefore have very high dirigibility;
4) a whole set of method can be applicable in multiple vertical search, is not simple Webpage search, also can be used for news search, video search etc.
Accompanying drawing explanation
Fig. 1 is that the present invention evaluates and tests process flow diagram flow chart.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
According to the analysis of each search engine and user are used to investigation of search engine etc., can confirm user to the concern of search engine mostly accuracy and sequence two aspect, accuracy is that the content demonstrating in order to ensure Search Results is that user wants, during sequence for before the result of the demand of being more close to the users is come, allow user not need drop-down or page turning just can directly find the content of wanting, so the present invention is mainly usingd this two aspect and as starting point, is realized the evaluation and test of the result of each search engine.
Concrete steps are as follows:
1) select evaluation and test word
The quality of choosing of evaluation and test word is directly determining the compatible degree of evaluation result and actual effect, in order to make evaluation and test can cover more number of searches, the present invention's acquiescence is chosen 3000 high frequency words in search-engine results as evaluation and test sample, and these words can extract from user's search seniority among brothers and sisters.In the scope of word, select and quantity is selected, can change according to actual conditions, if evaluation and test Webpage search is chosen page searched key word, if video search, choose film title that high frequency searches for or performer etc.
2) capture each search engine search results
Result of study to user behavior shows, most of users are only concerned about first 2 pages of Search Results, namely general 40, so the present invention gives tacit consent to front 40 data that capture in Search Results and researchs and analyses (data number can configure according to demand voluntarily).For returning results in search engine, great majority can return to link and the summary of source address, and owing to being incomplete result, the present invention will carry out secondary crawl, goes source address to capture complete results page, for calculating the degree of correlation between this page and search word.
The detailed process of this twice crawl is, first according to keyword, generate the Search Results link of search engine, capture for the first time, by the template of regular expression, from each search engine, extract the link of relevant information He each results page details of each result, and save, this connects for secondary crawl.
Capturing is for the second time to obtain the link of results page details the result from capturing for the first time, and captures respective page, saves in order, offers step 3 and uses.
3) text extracts
From source address, capture the results page of coming, mostly have the noises such as advertisement, therefore before the correlativity of result of calculation, we will extract the body matter of result page, in order to avoid these noises impact result of calculation.
On context extraction method, can adopt HTML extracting method based on dom tree or text the common methods such as text extraction method of long string obtain the text in results page, and calculate according to this correlativity between this piece of article and searched key word.
First the HTML extracting method of dom tree converts html text to a dom tree, then according to the node analysis of dom tree, extracts the content that text is relevant, removes the irrelevant informations such as page noise and html tag; The emphasis of this method is when dom tree is imperfect, how correctly to repair dom tree.
Text the extracting method of long string to be applicable to text be this page of long article; First in HTML content, find the longest text string, and then front and back expansion, until expand to threshold value, then block, extract, obtain the body matter of text.
4) calculate correlativity
The calculating of correlativity is key one ring in flow process of the present invention, and previous step 2 and step 3 are all prepared in order to calculate correlativity, only has the correlativity of correctly calculating each Search Results and searched key word, the correctness of guarantee final appraisal results.
Choosing of correlation calculations rule, also can change to some extent according to different vertical searches: if Webpage search is more focused on content matching degree, if news search, need to pay close attention to content matching and time, if video search is more paid close attention to title and annotation etc. simultaneously.
In the present invention, the algorithm of correlativity can be adjusted flexibly, the result of artificial evaluation and test of small part of can take is sample, by the method for machine learning, dynamically adjusts the required weights of correlation calculations, also can directly adopt the relevance algorithms of some moulding.
For example, in the test of news search, adopted word frequency ratio method to calculate the correlativity of plain text, specific algorithm is correlativity=word frequency proportion * word frequency proportion in all crawl results in this document, that is:
P ( T ) = P ( W ) 3 * P ( D ) ;
Wherein,
P ( W ) = Σ i = 0 n N ( i ) * L ( i ) / L ( T ) ,
Being opened 3 powers is for balance and P(D) between weight;
In formula, n is for cutting word quantity after word, and number of times N(i) occurring for word i, L(i) is the length of word i, is L(T) length transcript;
P ( D ) = Σ i = 0 n N ( i ) / T ( i ) ,
In formula, number of times T(i) occurring in all Search Results of all search engines for word i;
The correlativity of time has adopted the mode of reciprocal curve, for
P ( M ) = W 1 + T ( n ) - T ( t ) ,
In formula, T (n) is current time, and T (t) is the cloth time, and molecule W is weighted value, is used for balance P(M) and P(T) between weight;
Final correlativity has adopted both harmonic-means to calculate,
P = 2 1 P ( M ) + 1 P ( T )
The weight that can improve so low that of correlativity, makes result more trend towards actual conditions.
5) integrate and press relevance ranking
Step 4 is calculated correlativity for each piece of result document, here all result document of single searched key word being returned on all search engines are integrated, according to correlativity, sort, then result is equally divided into excellent-in-differ from three classes and (can be divided into multiclass by different demands herein, for automation mechanized operation), and give the corresponding related coefficient mark of each class (if the N class that is set as 3-1 minute, mark is N-1) offer DCG computing formula, allow it calculate final DCG mark.
6) calculate DCG
DCG is a kind of good and bad evaluating method of sequence of verifying, the document that correlativity is high come result page before, mark will be high, otherwise, correlativity is low come before, mark will be low.The DCG computing formula of s piece of writing document is:
DCG s = rel 1 + Σ i = 2 s rel i log 2 i
Step 5 sorts the Search Results of single searched key word, and has distributed corresponding related coefficient mark for every piece of document, namely the reli in formula.Then all results of this keyword are pressed to search engine grouping, in single search engine group, rank i according to all results in its search engine calculates the DCG gross score of this keyword in this search engine with formula, calculates all groups just obtained the DCG mark of this keyword in each search engine with this.
In the computation process of DCG, there is following several situation:
1. the result of search engine A is generally better than search engine B, but sequence does not have B good, now due to rel agenerally higher than rel bso, the result of DCG be A higher than B, meet logic.
2. the results relevance of the result of search engine A and search engine B is similar, but the sequence of A is better, now the high rel of mark bthe sort algorithm 1/log that can be ranked behind 2i drags down, and the whole DCG that causes B lower than A, meets logic.
3. the result of search engine A is better than search engine B, sorts also good than B, and the DCG of A, certainly higher than B, meets logic.
These 3 kinds of situations have all proved in implementation procedure of the present invention, and the result of DCG can be used as the standard of evaluation and test search-engine results quality.
7) by DCG result, sort, sum up evaluation result
Acquired results in step 6 is sorted and at length analyzed, can obtain multiple Output rusults, as resultful average DCG mark rank, total DCG mark rank, the good and bad number rank of Search Results etc. in all keywords, generating report forms, so that contrast is checked intuitively.
Adopt method of the present invention can obtain quickly and easily evaluation result, the plenty of time and the manpower consumption that have avoided artificial evaluation and test to bring completely.With the news search in vertical search, test, choose 3000 hot words of news, Baidu, search dog, in search, 4 search engines of Yahoo (Google evaluates and tests target because the problems such as frequent shielding do not add), each search engine is chosen 40 Search Results, and the evaluation and test time is approximately 2 hours (bottleneck is that webpage captures); By discovery after the result contrast of acquired results and manually evaluation and test, the result difference of evaluation result of the present invention and artificial evaluation and test is in 5%.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any modification of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.

Claims (8)

1. a multiple search engine robotization contrast evaluating method that does not rely on document library, is characterized in that, described method comprises the steps:
A. select evaluation and test word;
B. capture Search Results and save as document;
C. extract document text;
D. calculate correlativity;
E. integrate document and by its relevance ranking;
F. calculate DCG;
G. by DCG result, sort, sum up evaluation result.
2. evaluating method as claimed in claim 1, is characterized in that, described evaluation and test word comprises: the page searched key word in Webpage search, film title or the actor name in video search.
3. evaluating method as claimed in claim 1, is characterized in that, described crawl comprises crawl process twice;
Capture and comprise for the first time: according to keyword, generate the Search Results link of search engine, capture for the first time, by template, from each search engine, extract the link of relevant information He each results page details of each result, and preserve; Described template is the regular expression that comprises search condition;
Capture and comprise for the second time: according to the link of the results page details that obtain in capturing for the first time, capture respective page, and save as respectively in order document.
4. evaluating method as claimed in claim 1, is characterized in that, the extracting method of described text comprises: the HTML extracting method based on dom tree, text be the text extraction method of long string;
The described HTML extracting method based on dom tree comprises: html text is changed into a dom tree, then according to the node analysis of dom tree, extract the content that text is relevant, to remove irrelevant information in the page; This irrelevant information comprises: page noise and html tag;
The described text text extraction method of long string comprises: in html page content, find the longest text string, and then front and back expansion, until expand to threshold value, then block, extract, obtain the body matter of text.
5. evaluating method as claimed in claim 1, is characterized in that, the computing method of described correlativity comprise: word frequency rule of three; The expression formula of the method is: correlativity=word frequency is proportion * word frequency proportion in all crawl results in this document.
6. evaluating method as claimed in claim 1, is characterized in that, describedly by relevance ranking, comprises: described document is divided equally for some grades, and be the corresponding related coefficient mark of each level setting.
7. evaluating method as claimed in claim 1, is characterized in that, described calculating DCG is as shown in the formula expression:
DCG s = rel 1 + Σ i = 2 s rel i log 2 i
In formula, s is the total record of document, and i is the ordinal number of the document place grade, rel irelated coefficient mark for the document place grade.
8. evaluating method as claimed in claim 1, is characterized in that: calculated results in described step F is sorted and analyzed, draw multiple Output rusults, generating report forms; Described Output rusults comprises: the average DCG mark rank of calculated results in step F, total DCG mark rank, the good and bad number rank of Search Results in all keywords.
CN201310538069.5A 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library Expired - Fee Related CN103544307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538069.5A CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538069.5A CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Publications (2)

Publication Number Publication Date
CN103544307A true CN103544307A (en) 2014-01-29
CN103544307B CN103544307B (en) 2017-08-08

Family

ID=49967759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538069.5A Expired - Fee Related CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Country Status (1)

Country Link
CN (1) CN103544307B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699830A (en) * 2015-03-30 2015-06-10 北京奇虎科技有限公司 Method and device for evaluating search engine ordering algorithm effectiveness
CN104699825A (en) * 2015-03-30 2015-06-10 北京奇虎科技有限公司 Method and device for measuring performance of search engines
CN105808601A (en) * 2014-12-31 2016-07-27 北京奇虎科技有限公司 Calculation method and device for evaluating recording loss of search engine
CN106227762A (en) * 2016-07-15 2016-12-14 苏群 A kind of method for vertical search assisted based on user and system
CN106776299A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 Search engine test device and method
CN107704467A (en) * 2016-08-09 2018-02-16 百度在线网络技术(北京)有限公司 Search quality appraisal procedure and device
WO2018187949A1 (en) * 2017-04-12 2018-10-18 邹霞 Perspective analysis method for machine learning model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
US20090157652A1 (en) * 2007-12-18 2009-06-18 Luciano Barbosa Method and system for quantifying the quality of search results based on cohesion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
US20090157652A1 (en) * 2007-12-18 2009-06-18 Luciano Barbosa Method and system for quantifying the quality of search results based on cohesion

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808601A (en) * 2014-12-31 2016-07-27 北京奇虎科技有限公司 Calculation method and device for evaluating recording loss of search engine
CN105808601B (en) * 2014-12-31 2019-07-23 北京奇虎科技有限公司 Assessment search engine resource includes the calculation method and device of loss
CN104699830A (en) * 2015-03-30 2015-06-10 北京奇虎科技有限公司 Method and device for evaluating search engine ordering algorithm effectiveness
CN104699825A (en) * 2015-03-30 2015-06-10 北京奇虎科技有限公司 Method and device for measuring performance of search engines
CN106227762A (en) * 2016-07-15 2016-12-14 苏群 A kind of method for vertical search assisted based on user and system
CN106227762B (en) * 2016-07-15 2019-06-28 苏群 A kind of method for vertical search and system based on user's assistance
CN107704467A (en) * 2016-08-09 2018-02-16 百度在线网络技术(北京)有限公司 Search quality appraisal procedure and device
CN106776299A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 Search engine test device and method
WO2018187949A1 (en) * 2017-04-12 2018-10-18 邹霞 Perspective analysis method for machine learning model

Also Published As

Publication number Publication date
CN103544307B (en) 2017-08-08

Similar Documents

Publication Publication Date Title
CN106649818B (en) Application search intention identification method and device, application search method and server
CN103544307A (en) Multi-search-engine automatic comparison and evaluation method independent of document library
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
CN102982153B (en) A kind of information retrieval method and device thereof
CN102063469B (en) Method and device for acquiring relevant keyword message and computer equipment
CN103186574B (en) A kind of generation method and apparatus of Search Results
CN105095187A (en) Search intention identification method and device
CN101299217B (en) Method, apparatus and system for processing map information
JP5543020B2 (en) Research mission identification
CN102567494B (en) Website classification method and device
CN105426514A (en) Personalized mobile APP recommendation method
CN104008109A (en) User interest based Web information push service system
CN108334610A (en) A kind of newsletter archive sorting technique, device and server
CN102243661B (en) Website content quality assessment method and device
CN103902597A (en) Method and device for determining search relevant categories corresponding to target keywords
CN103577416A (en) Query expansion method and system
CN106021418B (en) The clustering method and device of media event
CN105653701A (en) Model generating method and device as well as word weighting method and device
Costa et al. Learning temporal-dependent ranking models
CN103020212A (en) Method and device for finding hot videos based on user query logs in real time
CN103116635B (en) Field-oriented method and system for collecting invisible web resources
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN109145301B (en) Information classification method and device and computer readable storage medium
CN101350011A (en) Method for detecting search engine cheat based on small sample set
CN112307336B (en) Hot spot information mining and previewing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20170427

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Applicant after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY Co.,Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Applicant before: BEIJING ZHONGSOU NETWORK TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170808

Termination date: 20211104

CF01 Termination of patent right due to non-payment of annual fee