CN103064984B - The recognition methods of spam page and system - Google Patents

The recognition methods of spam page and system Download PDF

Info

Publication number
CN103064984B
CN103064984B CN201310029963.XA CN201310029963A CN103064984B CN 103064984 B CN103064984 B CN 103064984B CN 201310029963 A CN201310029963 A CN 201310029963A CN 103064984 B CN103064984 B CN 103064984B
Authority
CN
China
Prior art keywords
inquiry
results
web page
page
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310029963.XA
Other languages
Chinese (zh)
Other versions
CN103064984A (en
Inventor
刘奕群
马少平
张敏
金奕江
张阔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Original Assignee
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Beijing Sogou Technology Development Co Ltd filed Critical Tsinghua University
Priority to CN201310029963.XA priority Critical patent/CN103064984B/en
Publication of CN103064984A publication Critical patent/CN103064984A/en
Application granted granted Critical
Publication of CN103064984B publication Critical patent/CN103064984B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention proposes recognition methods and the system of a kind of spam page.Wherein, method includes: obtain the inquiry log of search engine and inquiry log carries out pretreatment acquisition pretreatment inquiry log;The Query Result set more than threshold value of user's clicking rate of inquiry and the occurrence number of results web page is filtered out from the multiple queries and results web page of pretreatment inquiry log;From Query Result set, artificial screening extracts multiple spam page and generates spam page sample set;Rubbish score and the cheating score of each inquiry calculating each results web page in Query Result set is added up to according to Query Result set and spam page sample set;It is spam page when the rubbish score of results web page is more than threshold value then results web page, and results web page is added in spam page set.Method according to embodiments of the present invention, is found by search engine inquiry daily record and identifies that spam page reduces algorithm complex, has preferable generalization and adaptability.

Description

The recognition methods of spam page and system
Technical field
The present invention relates to network information Intelligent treatment technical field, particularly to the recognition methods of a kind of spam page and be System.
Background technology
Being skyrocketed through of internet information amount makes search engine become indispensable information in people's routine work and life to obtain Take means.According to the statistics of CNNIC in December, 2011, search engine in the netizen colony of China The quantity of user has reached 3.96 hundred million, and application popularization rate nearly 80%, is that netizen uses one of most Internet service. Search engine plays important entrance effect during the online of user, therefore, takes in search engine retrieving result Favourable ranking has become as Internet resources and obtains the most effective approach that user pays close attention to as early as possible.
Under this information acquiring pattern with search engine as main entrance, the high flow capacity that high search rank brings and high receipts Benefit lures that many internet content providers use cheating mode to swindle search engine algorithms, to obtain advantageous knot into Really ranking, and the webpage that this use cheating mode is made a profit based on swindle is exactly spam page.The definition of spam page is: Utilize search engine to run the defect of algorithm, take the fraudulent means for search engine so that it is obtain and believe higher than its network Breath quality ranking effect is to seek the webpage of direct or indirect interests.
Fetterly et al. in 2003 by the sampling analysis to English Webpage, it is believed that at least a part of which has 8.1% The page is spam page;AndEt al. then estimated the rubbish contents of about 10% to 15% in Web in 2004; According to our sampling analysis to about 800,000,000 Chinese web pages under search dog search engine is assisted, Chinese Internet resources there are about The webpage of 15% belongs to spam page.
Spam page all can produce significant adverse effect for the network user, Internet resources environment and search engine.For For the network user, spam page comes position forward in retrieval the results list and clicks on user cheating, and this behavior increases Add user and searched the difficulty of the useful information wanted, reduce the information acquisition efficiency of user;Spam page also tends to sick Poison, Trojan software etc. combine, and the information security of user is caused serious impact.For Internet resources environment, by In the restriction of state's laws regulation, search engine generally will not provide bid advertisement for the illegal Web content such as pornographic, gambling Service, this makes to promote ranking by cheating mode becomes the main selection of these contents offer website, in spam page Thus also it is flooded with all kinds of illegal contents, and the illegal contents webpage of this addition cheating technology often causes widely Harmful effect, more serious destruction Internet resources environment.For search engine system, the existence of spam page causes It is full of the useless page in data directory, wastes a large amount of memory space and process time, thus strengthen search engine and processing often Consumption during individual inquiry, reduces search treatment effeciency, reduces user's degree of belief to search engine simultaneously.
The a kind of of conventional garbage web page identification method is for the Study of recognition work aspect practised fraud based on content, for rubbish URL feature and the common phrases feature of the page are analyzed, and enter 1.05 hundred million webpages captured based on MSN search Go content of pages feature extraction, employed and include length for heading, the average length of word, the ratio of content visible, interior Hold the features such as compression ratio and distinguish spam page and normal webpage.Also use more content characteristic on this basis to enter Row identifies work, and its feature includes the quantity etc. in the quantity of Anchor Text, the page containing popular vocabulary, and employs sequence Feature is merged the identification carrying out spam page by learning method.
Another kind is spam page identification based on link structure analysis.Et al. 2004 propose TrustRank algorithm opens a new way utilizing link structure information identification spam page, can apply to bag Include content cheating and the link cheating identification at interior various garbage webpage.Although the method lacks in link structure figure The coping style of noise data, but still have a considerable amount of researcher based on proposing many to the improvement of TrustRank algorithm Individual link analysis technology is applied to spam page identification, and these algorithms include Anti-TrustRank, Truncated PageRank etc..
Above spam page identification is operated in relatively-stationary webpage test set and closes and all obtain preferable recognition effect, The evaluation result that internationally recognizable spam page evaluation and test Web Spam Challenge is given much reaches the identification of more than 80% Accuracy rate, the experimental result accuracy rate that many correlative theses are given is then often beyond 90%.But, various reasons causes These recognizers still suffer from huge challenge when being applied to true internet environment, are difficult to give full play to it and identify Effect, this also result in the fact that search engine application is still caused tremendous influence by current spam page.
The shortcoming of prior art is as follows:
(1) these algorithms often can only be identified for certain certain types of spam page, lacks the robust identified Property, and the cheating form of spam page emerges in an endless stream, although recognizer is the highest for the recognition performance of certain class spam page, But cannot be identified other kinds of rubbish, spam page author once uses new cheating form, and these algorithms are just Often lose identification effectiveness.
(2) along with the development of cheating form, many algorithms need to expend a large amount of calculating, storage or the mode of bandwidth resources Carry out rubbish identification, such as, web page contents is carried out many gram language model structure, webpage is repeatedly captured, to net Page script carries out deep layer parsing etc., and this makes the efficiency of these algorithm identification spam pages need with the online service of search engine Ask inconsistent, thus cannot be applied in actual search engine service.
Summary of the invention
The purpose of the present invention is intended at least solve one of above-mentioned technological deficiency.
For reaching above-mentioned purpose, the embodiment of one aspect of the present invention proposes the recognition methods of a kind of spam page, including following Step: S1: obtain the inquiry log of search engine and described inquiry log is carried out pretreatment acquisition pretreatment inquiry log, Wherein, described pretreatment inquiry log includes multiple queries and results web page;S2: many from described pretreatment inquiry log Individual inquiry and results web page filter out user's clicking rate of described inquiry and the occurrence number of described results web page more than threshold Inquiry-the results set of value;S3: artificial screening extracts multiple spam page and generates rubbish from described inquiry-results set Rubbish webpage sample set;S4: add up to according to described inquiry-results set and spam page sample set and calculate described inquiry-result The rubbish score of each results web page and the cheating score of each inquiry in set;And S5: if described inquiry-result In set, the rubbish score of results web page is spam page more than the most described results web page of threshold value, and described results web page is added It is added in described spam page set.
Method according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits Ying Xing.
In an example of the present invention, described step S1 specifically includes: S11: obtain the inquiry log of search engine, and Described inquiry log is converted to GBK form;S12: carry out the inquiry log after described conversion arranging acquisition pretreatment Inquiry log.
In an example of the present invention, described step S2 specifically includes: S21: each to described pretreatment inquiry log Inquiry participle is multiple key word, and the click results web page of described each key word with user is built the first inquiry-knot Fruit set;S22: calculate user's results web page click frequency of each inquiry in described first inquiry-results set, and from In filter out user's clicking rate and generate the second inquiry-results set more than the inquiry of threshold value and results web page;S23: calculate institute State the number of times that in the second inquiry-results set, each result occurs in described second inquiry-results set, and therefrom screen Occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.
In an example of the present invention, described step S4 specifically includes: S41: arrange in described inquiry-results set each The score of initially practising fraud of individual inquiry, and the initial waste score of results web page in described inquiry-results set is set;S42: The meansigma methods of the rubbish score calculating all results web page being associated with each inquiry in described inquiry-results set is made Cheating score for correspondence inquiry;And S43: calculate and be associated with each results web page in described inquiry-results set The meansigma methods of cheating score of all inquiries, if described results web page is not in spam page, by described cheating score Meansigma methods as the rubbish score of corresponding webpage, the most do not change described rubbish score.
For reaching above-mentioned purpose, on the other hand embodiments of the invention propose the identification system of a kind of spam page, including: Pretreatment module, is used for obtaining the inquiry log of search engine and described inquiry log carrying out pretreatment acquisition pretreatment looking into Asking daily record, wherein, described pretreatment inquiry log includes multiple queries and results web page;Screening module, for from described The multiple queries of pretreatment inquiry log and results web page filter out user's clicking rate of described inquiry and described result net The occurrence number of page is more than the inquiry-results set of threshold value;Extraction module, for artificial from described inquiry-results set Screening extracts multiple spam page and generates spam page sample set;Computing module, for according to described inquiry-result Set and spam page sample set add up to be calculated the rubbish score of each results web page in described inquiry-results set and each looks into The cheating score ask;Judge module, for judging that in described inquiry-results set, the rubbish score of results web page is the biggest In threshold value, it is then spam page if greater than threshold value;And processing module, described for described results web page is added to In spam page set.
System according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits Ying Xing.
In an example of the present invention, described pretreatment module includes: obtain converting unit, for obtaining search engine Inquiry log, and described inquiry log is converted to GBK form;Pretreatment unit, for the inquiry after described conversion Daily record carries out arranging acquisition pretreatment inquiry log.
In an example of the present invention, described screening module includes: construction unit, for described pretreatment inquiry log Each inquiry participle be multiple key word, and the click results web page of described each key word and user built first look into Inquiry-results set;First computing unit, for calculating user's result of each inquiry in described first inquiry-results set Webpage click frequency, and therefrom filter out user's clicking rate and generate the second inquiry-knot more than the inquiry of threshold value and results web page Fruit set;Second computing unit, be used for calculating in described second inquiry-results set each result described second inquiring about- The number of times occurred in results set, and therefrom screening occurrence number generates inquiry-knot more than inquiry and the results web page of threshold value Fruit set.
In an example of the present invention, described computing module includes: arrange unit, is used for arranging described inquiry-result set Score of initially practising fraud of each inquiry in conjunction, and the initial waste of results web page in described inquiry-results set is set obtains Point;3rd computing unit, for calculating all result nets being associated with each inquiry in described inquiry-results set The cheating score that the meansigma methods of the rubbish score of page is inquired about as correspondence;And the 4th computing unit, for calculating with described The meansigma methods of the cheating score of all inquiries that each results web page in inquiry-results set is associated, if described knot Really webpage not in spam page then using the meansigma methods of described cheating score as the rubbish score of corresponding webpage, the most more Change described rubbish score.
Aspect and advantage that the present invention adds will part be given in the following description, and part will become from the following description Substantially, or by the practice of the present invention recognize.
Accompanying drawing explanation
The present invention above-mentioned and/or that add aspect and advantage will become bright from the following description of the accompanying drawings of embodiments Aobvious and easy to understand, wherein:
Fig. 1 is the flow chart of the recognition methods of the spam page according to one embodiment of the invention;
Fig. 2 is the pretreated log organization structure figure according to one embodiment of the invention;
Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention;
Fig. 4 is the frame diagram of the identification system of the spam page according to another embodiment of the present invention
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of embodiment is shown in the drawings, the most identical or Similar label represents same or similar element or has the element of same or like function.Retouch below with reference to accompanying drawing The embodiment stated is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
In describing the invention, it is to be understood that term " first ", " second ", " the 3rd ", " the 4th " It is only used for describing purpose, and it is not intended that instruction or hint relative importance or implicit indicate indicated technical characteristic Quantity.Thus, define " first ", " second ", " the 3rd ", the feature of " the 4th " can express or Implicitly include one or more this feature.In describing the invention, " multiple " are meant that two or two Above, unless otherwise expressly limited specifically.
Fig. 1 is the flow chart of the recognition methods of the spam page according to one embodiment of the invention.As it is shown in figure 1, according to The recognition methods of the spam page of the embodiment of the present invention, comprises the following steps:
Step S101, obtains the inquiry log of search engine and inquiry log carries out pretreatment acquisition pretreatment inquiry day Will, wherein, pretreatment inquiry log includes multiple queries and results web page.
Specifically, first obtain the inquiry log of search engine, and inquiry log is converted to GBK form.Then, right Inquiry log after conversion carries out arranging acquisition pretreatment inquiry log, and the structure chart of its pretreatment inquiry log, such as Fig. 2 Shown in.Table 1 is the content that search engine inquiry daily record includes after pretreatment.
Table 1
In one embodiment of the invention, the daily record used includes search dog search engine on March 1st, 2011 All inquiries to the 9 day time of 9 days.Wherein, comprise 8,443,963 different inquiries, 12,470,865 Individual different webpage clicking, these webpages belong to 1,055,001 different website.The information that daily record includes such as table 2 Shown in.
Table 2
The log information of table 2 contains enough items of information for search engine automatic Evaluation, therefore can utilize this Individual daily record carries out the performance evaluation of each Chinese search engine.
Step S102, filter out from the multiple queries and results web page of pretreatment inquiry log inquiry user's clicking rate and The occurrence number of results web page is more than the inquiry-results set of threshold value.
Specifically, each inquiry participle to pretreatment inquiry log is multiple key word, and by each key word and user Click results web page build the first inquiry-results set.Then the use of each inquiry in the first inquiry-results set is calculated Family results web page click frequency, and therefrom filter out user's clicking rate and generate second more than the inquiry of threshold value and results web page and look into Inquiry-results set, then calculate that each result in the second inquiry-results set occurs in the second inquiry-results set time Number, and therefrom screening occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.
Step S103, from inquiry-results set, artificial screening extracts multiple spam page and generates spam page sample set Close.
Specifically, from inquiry-results set, randomly draw the Search Results of multiple quantity, such as, 1000 inquiries- As a result, and results web page therein is made whether the mark into spam page, until the spam page quantity marked out reaches To predetermined quantity, such as, mark is stopped when 200, if the quantity of spam page does not reaches predetermined quantity, then from looking into Inquiry-results set continues extraction 1000 be labeled, by that analogy, until spam page quantity reaches predetermined quantity Till.Using the spam page that marks out as spam page sample set.
Step S104, adds up to according to inquiry-results set and spam page sample set and calculates each result in inquiry-results set The rubbish score of webpage and the cheating score of each inquiry.
Specifically, arrange in inquiry-results set each inquiry initially practises fraud to be divided into 0, and arranges inquiry-result The initial waste score of results web page in set, if the results web page in inquiry-results set is at spam page sample set In conjunction, then corresponding initial waste score being set to 1, otherwise the initial waste score of its correspondence is set to 0.Then, Calculate the meansigma methods of rubbish score of all results web page being associated with each inquiry in inquiry-results set as right The cheating score that should inquire about.Finally, calculate with inquire about-results set in all inquiries that are associated of each results web page The meansigma methods of cheating score, if results web page is not in spam page, using the meansigma methods of cheating score as corresponding net The rubbish score of page, does not the most change rubbish score.In an embodiment of the present invention, by above-mentioned rubbish score and cheating The update method of score is sequentially repeated repeatedly generally 20-30 time, and the final rubbish obtained must be divided into results web page Rubbish score.
Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention.Such as Fig. 3 Shown in, inquiry-results set contains the corresponding relation inquired about between result, and the size of strength of association between the two Then by the frequency of occurrences of inquiry-results set (in figure 3 by wiiRepresent) record.Small-scale rubbish from manual mark Webpage sample set is set out, and can calculate the spam page score of each webpage with progressive alternate.Assume URL1For spam page Webpage (its rubbish must be divided into 1) in sample set, and URL2It not that the webpage in spam page sample set is (at the beginning of it Beginning rubbish must be divided into 0), then Query1And Query3Key word cheating score during iteration is URL for the first time1And URL2 Spam page score averages (can be average by equal weight, it is also possible to by strength of association size weighted average); Further, URL2Spam page score value be Query1And Query3Key word cheating score averages (can by etc. Weight is average, it is also possible to by strength of association size weighted average), it is achieved thereby that spam page score is from sample set Close the diffusion of other webpages.By that analogy, the spam page score of all webpages can i.e. be calculated.
Step S105, is rubbish net by the results web page that the rubbish score of results web page in inquiry-results set is more than threshold value Page, and results web page is added in spam page set.
In one embodiment of the invention, the rubbish score threshold of spam page criterion can according to circumstances depending on, example As, it is set to 0.8.The spam page identified is added in spam page set as the data identifying spam page Use.
Method according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits Ying Xing.
Fig. 4 is the frame diagram of the identification system of the spam page according to another embodiment of the present invention.As shown in Figure 4, root Pretreatment module 100, screening module 200, extraction module is included according to the identification system of the spam page of the embodiment of the present invention 300, computing module 400, judge module 500 and processing module 600.
Pretreatment module 100 is for obtaining the inquiry log of search engine and inquiry log carrying out pretreatment acquisition pretreatment Inquiry log, wherein, pretreatment inquiry log includes multiple queries and results web page.
In one embodiment of the invention, pretreatment module 100 includes obtaining converting unit 110 and pretreatment unit 120。
Obtain converting unit 110 and be used for obtaining the inquiry log of search engine, and inquiry log is converted to GBK form.
Pretreatment unit 120 obtains pretreatment inquiry log for carrying out arranging to the inquiry log after conversion.
In one embodiment of the invention, obtain the inquiry log of search engine, and inquiry log Unified coding is changed For GBK form.Carry out arranging and filtering useless information acquisition pretreatment inquiry log to the inquiry log after conversion, Fig. 2 Structure chart for pretreatment inquiry log.
Screening module 200 is for filtering out user's point of inquiry from the multiple queries and results web page of pretreatment inquiry log Hit the occurrence number inquiry-results set more than threshold value of rate and results web page.
In one embodiment of the invention, screening module 200 includes construction unit the 210, first computing unit 220 and Second computing unit 230.
It is multiple key word that construction unit 210 is used for each inquiry participle to pretreatment inquiry log, and by each key Word builds the first inquiry-results set with the click results web page of user.
First computing unit 220 clicks on frequency for calculating user's results web page of each inquiry in the first inquiry-results set Rate, and therefrom filter out user's clicking rate and generate the second inquiry-results set more than the inquiry of threshold value and results web page.
Second computing unit 230 is for calculating in the second inquiry-results set each result in the second inquiry-results set The number of times occurred, and therefrom screening occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.
In one embodiment of the invention, from inquiry-results set, the Search Results of multiple quantity, example are randomly drawed As, 1000 inquiry-results, and results web page therein is made whether the mark into spam page, until marking out Spam page quantity reach predetermined quantity, such as, when 200 stop mark, if the quantity of spam page does not reaches Predetermined quantity, then continue extraction 1000 from inquiry-results set and be labeled, by that analogy, until spam page Till quantity reaches predetermined quantity.Using the spam page that marks out as spam page sample set.
Extraction module 300 extracts multiple spam page for artificial screening from inquiry-results set and generates spam page Sample set.
In one embodiment of the invention, from inquiry-results set, the Search Results of multiple quantity, example are randomly drawed As, 1000 inquiry-results, and results web page therein is made whether the mark into spam page, until marking out Spam page quantity reach predetermined quantity, such as, when 200 stop mark, if the quantity of spam page does not reaches Predetermined quantity, then continue extraction 1000 from inquiry-results set and be labeled, by that analogy, until spam page Till quantity reaches predetermined quantity.Using the spam page that marks out as spam page sample set.
Computing module 400 is for adding up in calculation inquiry-results set every according to inquiry-results set and spam page sample set The rubbish score of individual results web page and the cheating score of each inquiry.
In one embodiment of the invention, computing module 400 includes arranging unit the 410, the 3rd computing unit 420 and 4th computing unit 430.
Unit 410 is set for arranging in inquiry-results set score of initially practising fraud of each inquiry, and arrange inquiry- The initial waste score of results web page in results set.
3rd computing unit 420 for calculate with inquire about-results set in all results web page of being associated of each inquiry The cheating score inquired about as correspondence of the meansigma methods of rubbish score.
4th computing unit 430 for calculate with inquire about-results set in all inquiries that are associated of each results web page The meansigma methods of cheating score, if results web page is not in spam page, using the meansigma methods of cheating score as corresponding net The rubbish score of page, does not the most change rubbish score.
In an embodiment of the present invention, it is sequentially repeated repeatedly updates rubbish by the 3rd computing unit and the 4th computing unit Score and cheating score are generally 20-30 time, and the final rubbish obtained must be divided into the rubbish score of results web page.
Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention.Such as Fig. 3 Shown in, inquiry-results set contains the corresponding relation inquired about between result, and the size of strength of association between the two Then by the frequency of occurrences of inquiry-results set (in figure 3 by wiiRepresent) record.Small-scale rubbish from manual mark Webpage sample set is set out, and can calculate the spam page score of each webpage with progressive alternate.Assume URL1For spam page Webpage (its rubbish must be divided into 1) in sample set, and URL2It not that the webpage in spam page sample set is (at the beginning of it Beginning rubbish must be divided into 0), then Query1And Query3Key word cheating score during iteration is URL for the first time1And URL2 Spam page score averages (can be average by equal weight, it is also possible to by strength of association size weighted average); Further, URL2Spam page score value be Query1And Query3Key word cheating score averages (can by etc. Weight is average, it is also possible to by strength of association size weighted average), it is achieved thereby that spam page score is from sample set Close the diffusion of other webpages.By that analogy, the spam page score of all webpages can i.e. be calculated.
Whether judge module 500 is more than threshold value for the rubbish score judging results web page in inquiry-results set, if It it is then spam page more than threshold value.In one embodiment of the invention, the rubbish score threshold of spam page criterion Can according to circumstances depending on, such as, be set to 0.8 etc..
Processing module 600 is for adding to results web page in spam page set.The spam page identified is added Use as the data identifying spam page in spam page set.
System according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits Ying Xing.
Although above it has been shown and described that embodiments of the invention, it is to be understood that above-described embodiment is exemplary , it is impossible to being interpreted as limitation of the present invention, those of ordinary skill in the art is without departing from the principle of the present invention and objective In the case of above-described embodiment can be changed within the scope of the invention, revise, replace and modification.

Claims (4)

1. the recognition methods of a spam page, it is characterised in that comprise the following steps:
S1: obtain the inquiry log of search engine and described inquiry log is carried out pretreatment acquisition pretreatment Inquiry log, wherein, described pretreatment inquiry log includes multiple queries and results web page;
S2: filter out described inquiry from the multiple queries and results web page of described pretreatment inquiry log The occurrence number of user's clicking rate and described results web page is more than the inquiry-results set of threshold value;
S3: artificial screening extracts multiple spam page and generates rubbish net from described inquiry-results set Page sample set;
S4: add up to according to described inquiry-results set and spam page sample set and calculate described inquiry-result set The rubbish score of each results web page and the cheating score of each inquiry in conjunction;And
S5: if the rubbish score of results web page is more than threshold value, described knot in described inquiry-results set Really webpage is spam page, and described results web page is added in described spam page set,
Described step S2 specifically includes:
S21: each inquiry participle to described pretreatment inquiry log is multiple key word, and by described Each key word of multiple key words and the click results web page of user build the first inquiry-results set;
S22: the user's results web page calculating each inquiry in described first inquiry-results set clicks on frequency Rate, and therefrom filter out user's clicking rate and generate the second inquiry-knot more than the inquiry of threshold value and results web page Fruit set;
S23: calculate in described second inquiry-results set each result at described second inquiry-result set The number of times occurred in conjunction, and therefrom screening occurrence number generates inquiry more than inquiry and the results web page of threshold value -results set,
Described step S4 specifically includes:
S41: the score of initially practising fraud of each inquiry in described inquiry-results set is set, and institute is set State the initial waste score of results web page in inquiry-results set;
S42: all results web page that calculating is associated with each inquiry in described inquiry-results set The cheating score that the meansigma methods of rubbish score is inquired about as correspondence;And
S43: all inquiries that calculating is associated with each results web page in described inquiry-results set The meansigma methods of cheating score, if described results web page is not in spam page, by described cheating score Meansigma methods, as the rubbish score of corresponding webpage, does not the most change described rubbish score.
The recognition methods of spam page the most according to claim 1, it is characterised in that described step S1 specifically includes:
S11: obtain the inquiry log of search engine, and described inquiry log is converted to GBK form;
S12: carry out the inquiry log after described conversion arranging acquisition pretreatment inquiry log.
3. the identification system of a spam page, it is characterised in that including:
Pretreatment module, for obtaining the inquiry log of search engine and described inquiry log being carried out pre-place Reason obtains pretreatment inquiry log, and wherein, described pretreatment inquiry log includes multiple queries and result net Page;
Screening module, for filtering out from the multiple queries and results web page of described pretreatment inquiry log User's clicking rate of described inquiry and the occurrence number of described results web page are more than the inquiry-result set of threshold value Close;
Extraction module, extracts multiple spam page for artificial screening from described inquiry-results set Generate spam page sample set;
Computing module, for adding up to calculation described according to described inquiry-results set and spam page sample set The rubbish score of each results web page and the cheating score of each inquiry in inquiry-results set;
Judge module, for judging that in described inquiry-results set, the rubbish score of results web page is the biggest In threshold value, it is then spam page if greater than threshold value;And
Processing module, for described results web page is added in described spam page set,
Described screening module includes:
Construction unit, being used for each inquiry participle to described pretreatment inquiry log is multiple key word, And the click results web page of each key word of the plurality of key word with user is built the first inquiry-knot Fruit set;
First computing unit, for calculating user's knot of each inquiry in described first inquiry-results set Really webpage click frequency, and therefrom filter out user's clicking rate and generate more than the inquiry of threshold value and results web page Second inquiry-results set;
Second computing unit, is used for calculating in described second inquiry-results set each result described The number of times occurred in two inquiries-results set, and therefrom screening occurrence number is more than inquiry and the knot of threshold value Really auto-building html files inquiry-results set,
Described computing module includes:
Unit is set, for arranging the score of initially practising fraud of each inquiry in described inquiry-results set, And the initial waste score of results web page in described inquiry-results set is set;
3rd computing unit, is associated with each inquiry in described inquiry-results set for calculating The cheating score that the meansigma methods of the rubbish score of all results web page is inquired about as correspondence;And
4th computing unit is relevant to each results web page in described inquiry-results set for calculating The meansigma methods of the cheating score of all inquiries of connection, if described results web page is not in spam page, will The meansigma methods of described cheating score, as the rubbish score of corresponding webpage, is not the most changed described rubbish and is obtained Point.
The identification system of spam page the most according to claim 3, it is characterised in that described pre- Processing module includes:
Obtain converting unit, for obtaining the inquiry log of search engine, and described inquiry log is changed For GBK form;
Pretreatment unit, obtains pretreatment inquiry day for carrying out the inquiry log after described conversion arranging Will.
CN201310029963.XA 2013-01-25 2013-01-25 The recognition methods of spam page and system Active CN103064984B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310029963.XA CN103064984B (en) 2013-01-25 2013-01-25 The recognition methods of spam page and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310029963.XA CN103064984B (en) 2013-01-25 2013-01-25 The recognition methods of spam page and system

Publications (2)

Publication Number Publication Date
CN103064984A CN103064984A (en) 2013-04-24
CN103064984B true CN103064984B (en) 2016-08-10

Family

ID=48107614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310029963.XA Active CN103064984B (en) 2013-01-25 2013-01-25 The recognition methods of spam page and system

Country Status (1)

Country Link
CN (1) CN103064984B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598460B (en) * 2013-10-30 2018-11-02 腾讯科技(深圳)有限公司 The recognition methods of rubbish Anchor Text and device
CN103595732B (en) * 2013-11-29 2017-09-15 北京奇虎科技有限公司 A kind of method and device of network attack evidence obtaining
CN104933055B (en) * 2014-03-18 2020-01-31 腾讯科技(深圳)有限公司 Webpage identification method and webpage identification device
CN106844371B (en) * 2015-12-03 2020-09-08 阿里巴巴集团控股有限公司 Search optimization method and device
CN106844685B (en) * 2017-01-26 2020-07-28 百度在线网络技术(北京)有限公司 Method, device and server for identifying website
CN110147472B (en) * 2017-07-14 2021-10-15 北京搜狗科技发展有限公司 Detection method and device for cheating sites and detection device for cheating sites
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system
CN109361957B (en) * 2018-10-18 2021-02-12 广州酷狗计算机科技有限公司 Method and device for sending praise request
CN109831451A (en) * 2019-03-07 2019-05-31 北京华安普特网络科技有限公司 Preventing Trojan method based on firewall

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814093A (en) * 2010-04-02 2010-08-25 南京邮电大学 Similarity-based semi-supervised learning spam page detection method
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639773B2 (en) * 2010-06-17 2014-01-28 Microsoft Corporation Discrepancy detection for web crawling

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814093A (en) * 2010-04-02 2010-08-25 南京邮电大学 Similarity-based semi-supervised learning spam page detection method
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature

Also Published As

Publication number Publication date
CN103064984A (en) 2013-04-24

Similar Documents

Publication Publication Date Title
CN103064984B (en) The recognition methods of spam page and system
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN110647629A (en) Multi-document machine reading understanding method for multi-granularity answer sorting
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN104679825B (en) Macroscopic abnormity of earthquake acquisition of information based on network text and screening technique
CN109522562B (en) Webpage knowledge extraction method based on text image fusion recognition
TWI695277B (en) Automatic website data collection method
DE112013004082T5 (en) Search system of the emotion entity for the microblog
CN107239564B (en) Text label recommendation method based on supervision topic model
CN107577671A (en) A kind of key phrases extraction method based on multi-feature fusion
CN108345686A (en) A kind of data analysing method and system based on search engine technique
CN105528422A (en) Focused crawler processing method and apparatus
CN105740227A (en) Genetic simulated annealing method for solving new words in Chinese segmentation
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN105787121B (en) A kind of microblogging event summary extracting method based on more story lines
CN103440315B (en) A kind of Web page cleaning method based on theme
CN103530429A (en) Webpage content extracting method
CN110555154B (en) Theme-oriented information retrieval method
CN112287240A (en) Case microblog evaluation object extraction method and device based on double-embedded multilayer convolutional neural network
CN113901228B (en) Cross-border national text classification method and device fusing domain knowledge graph
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
CN113971398A (en) Dictionary construction method for rapid entity identification in network security field
CN112418269B (en) Social media network event propagation key time prediction method, system and medium
CN112966507A (en) Method, device, equipment and storage medium for constructing recognition model and identifying attack

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant