CN103064984B - The recognition methods of spam page and system - Google Patents
The recognition methods of spam page and system Download PDFInfo
- Publication number
- CN103064984B CN103064984B CN201310029963.XA CN201310029963A CN103064984B CN 103064984 B CN103064984 B CN 103064984B CN 201310029963 A CN201310029963 A CN 201310029963A CN 103064984 B CN103064984 B CN 103064984B
- Authority
- CN
- China
- Prior art keywords
- inquiry
- results
- web page
- page
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The present invention proposes recognition methods and the system of a kind of spam page.Wherein, method includes: obtain the inquiry log of search engine and inquiry log carries out pretreatment acquisition pretreatment inquiry log;The Query Result set more than threshold value of user's clicking rate of inquiry and the occurrence number of results web page is filtered out from the multiple queries and results web page of pretreatment inquiry log;From Query Result set, artificial screening extracts multiple spam page and generates spam page sample set;Rubbish score and the cheating score of each inquiry calculating each results web page in Query Result set is added up to according to Query Result set and spam page sample set;It is spam page when the rubbish score of results web page is more than threshold value then results web page, and results web page is added in spam page set.Method according to embodiments of the present invention, is found by search engine inquiry daily record and identifies that spam page reduces algorithm complex, has preferable generalization and adaptability.
Description
Technical field
The present invention relates to network information Intelligent treatment technical field, particularly to the recognition methods of a kind of spam page and be
System.
Background technology
Being skyrocketed through of internet information amount makes search engine become indispensable information in people's routine work and life to obtain
Take means.According to the statistics of CNNIC in December, 2011, search engine in the netizen colony of China
The quantity of user has reached 3.96 hundred million, and application popularization rate nearly 80%, is that netizen uses one of most Internet service.
Search engine plays important entrance effect during the online of user, therefore, takes in search engine retrieving result
Favourable ranking has become as Internet resources and obtains the most effective approach that user pays close attention to as early as possible.
Under this information acquiring pattern with search engine as main entrance, the high flow capacity that high search rank brings and high receipts
Benefit lures that many internet content providers use cheating mode to swindle search engine algorithms, to obtain advantageous knot into
Really ranking, and the webpage that this use cheating mode is made a profit based on swindle is exactly spam page.The definition of spam page is:
Utilize search engine to run the defect of algorithm, take the fraudulent means for search engine so that it is obtain and believe higher than its network
Breath quality ranking effect is to seek the webpage of direct or indirect interests.
Fetterly et al. in 2003 by the sampling analysis to English Webpage, it is believed that at least a part of which has 8.1%
The page is spam page;AndEt al. then estimated the rubbish contents of about 10% to 15% in Web in 2004;
According to our sampling analysis to about 800,000,000 Chinese web pages under search dog search engine is assisted, Chinese Internet resources there are about
The webpage of 15% belongs to spam page.
Spam page all can produce significant adverse effect for the network user, Internet resources environment and search engine.For
For the network user, spam page comes position forward in retrieval the results list and clicks on user cheating, and this behavior increases
Add user and searched the difficulty of the useful information wanted, reduce the information acquisition efficiency of user;Spam page also tends to sick
Poison, Trojan software etc. combine, and the information security of user is caused serious impact.For Internet resources environment, by
In the restriction of state's laws regulation, search engine generally will not provide bid advertisement for the illegal Web content such as pornographic, gambling
Service, this makes to promote ranking by cheating mode becomes the main selection of these contents offer website, in spam page
Thus also it is flooded with all kinds of illegal contents, and the illegal contents webpage of this addition cheating technology often causes widely
Harmful effect, more serious destruction Internet resources environment.For search engine system, the existence of spam page causes
It is full of the useless page in data directory, wastes a large amount of memory space and process time, thus strengthen search engine and processing often
Consumption during individual inquiry, reduces search treatment effeciency, reduces user's degree of belief to search engine simultaneously.
The a kind of of conventional garbage web page identification method is for the Study of recognition work aspect practised fraud based on content, for rubbish
URL feature and the common phrases feature of the page are analyzed, and enter 1.05 hundred million webpages captured based on MSN search
Go content of pages feature extraction, employed and include length for heading, the average length of word, the ratio of content visible, interior
Hold the features such as compression ratio and distinguish spam page and normal webpage.Also use more content characteristic on this basis to enter
Row identifies work, and its feature includes the quantity etc. in the quantity of Anchor Text, the page containing popular vocabulary, and employs sequence
Feature is merged the identification carrying out spam page by learning method.
Another kind is spam page identification based on link structure analysis.Et al. 2004 propose
TrustRank algorithm opens a new way utilizing link structure information identification spam page, can apply to bag
Include content cheating and the link cheating identification at interior various garbage webpage.Although the method lacks in link structure figure
The coping style of noise data, but still have a considerable amount of researcher based on proposing many to the improvement of TrustRank algorithm
Individual link analysis technology is applied to spam page identification, and these algorithms include Anti-TrustRank, Truncated
PageRank etc..
Above spam page identification is operated in relatively-stationary webpage test set and closes and all obtain preferable recognition effect,
The evaluation result that internationally recognizable spam page evaluation and test Web Spam Challenge is given much reaches the identification of more than 80%
Accuracy rate, the experimental result accuracy rate that many correlative theses are given is then often beyond 90%.But, various reasons causes
These recognizers still suffer from huge challenge when being applied to true internet environment, are difficult to give full play to it and identify
Effect, this also result in the fact that search engine application is still caused tremendous influence by current spam page.
The shortcoming of prior art is as follows:
(1) these algorithms often can only be identified for certain certain types of spam page, lacks the robust identified
Property, and the cheating form of spam page emerges in an endless stream, although recognizer is the highest for the recognition performance of certain class spam page,
But cannot be identified other kinds of rubbish, spam page author once uses new cheating form, and these algorithms are just
Often lose identification effectiveness.
(2) along with the development of cheating form, many algorithms need to expend a large amount of calculating, storage or the mode of bandwidth resources
Carry out rubbish identification, such as, web page contents is carried out many gram language model structure, webpage is repeatedly captured, to net
Page script carries out deep layer parsing etc., and this makes the efficiency of these algorithm identification spam pages need with the online service of search engine
Ask inconsistent, thus cannot be applied in actual search engine service.
Summary of the invention
The purpose of the present invention is intended at least solve one of above-mentioned technological deficiency.
For reaching above-mentioned purpose, the embodiment of one aspect of the present invention proposes the recognition methods of a kind of spam page, including following
Step: S1: obtain the inquiry log of search engine and described inquiry log is carried out pretreatment acquisition pretreatment inquiry log,
Wherein, described pretreatment inquiry log includes multiple queries and results web page;S2: many from described pretreatment inquiry log
Individual inquiry and results web page filter out user's clicking rate of described inquiry and the occurrence number of described results web page more than threshold
Inquiry-the results set of value;S3: artificial screening extracts multiple spam page and generates rubbish from described inquiry-results set
Rubbish webpage sample set;S4: add up to according to described inquiry-results set and spam page sample set and calculate described inquiry-result
The rubbish score of each results web page and the cheating score of each inquiry in set;And S5: if described inquiry-result
In set, the rubbish score of results web page is spam page more than the most described results web page of threshold value, and described results web page is added
It is added in described spam page set.
Method according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page
Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits
Ying Xing.
In an example of the present invention, described step S1 specifically includes: S11: obtain the inquiry log of search engine, and
Described inquiry log is converted to GBK form;S12: carry out the inquiry log after described conversion arranging acquisition pretreatment
Inquiry log.
In an example of the present invention, described step S2 specifically includes: S21: each to described pretreatment inquiry log
Inquiry participle is multiple key word, and the click results web page of described each key word with user is built the first inquiry-knot
Fruit set;S22: calculate user's results web page click frequency of each inquiry in described first inquiry-results set, and from
In filter out user's clicking rate and generate the second inquiry-results set more than the inquiry of threshold value and results web page;S23: calculate institute
State the number of times that in the second inquiry-results set, each result occurs in described second inquiry-results set, and therefrom screen
Occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.
In an example of the present invention, described step S4 specifically includes: S41: arrange in described inquiry-results set each
The score of initially practising fraud of individual inquiry, and the initial waste score of results web page in described inquiry-results set is set;S42:
The meansigma methods of the rubbish score calculating all results web page being associated with each inquiry in described inquiry-results set is made
Cheating score for correspondence inquiry;And S43: calculate and be associated with each results web page in described inquiry-results set
The meansigma methods of cheating score of all inquiries, if described results web page is not in spam page, by described cheating score
Meansigma methods as the rubbish score of corresponding webpage, the most do not change described rubbish score.
For reaching above-mentioned purpose, on the other hand embodiments of the invention propose the identification system of a kind of spam page, including:
Pretreatment module, is used for obtaining the inquiry log of search engine and described inquiry log carrying out pretreatment acquisition pretreatment looking into
Asking daily record, wherein, described pretreatment inquiry log includes multiple queries and results web page;Screening module, for from described
The multiple queries of pretreatment inquiry log and results web page filter out user's clicking rate of described inquiry and described result net
The occurrence number of page is more than the inquiry-results set of threshold value;Extraction module, for artificial from described inquiry-results set
Screening extracts multiple spam page and generates spam page sample set;Computing module, for according to described inquiry-result
Set and spam page sample set add up to be calculated the rubbish score of each results web page in described inquiry-results set and each looks into
The cheating score ask;Judge module, for judging that in described inquiry-results set, the rubbish score of results web page is the biggest
In threshold value, it is then spam page if greater than threshold value;And processing module, described for described results web page is added to
In spam page set.
System according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page
Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits
Ying Xing.
In an example of the present invention, described pretreatment module includes: obtain converting unit, for obtaining search engine
Inquiry log, and described inquiry log is converted to GBK form;Pretreatment unit, for the inquiry after described conversion
Daily record carries out arranging acquisition pretreatment inquiry log.
In an example of the present invention, described screening module includes: construction unit, for described pretreatment inquiry log
Each inquiry participle be multiple key word, and the click results web page of described each key word and user built first look into
Inquiry-results set;First computing unit, for calculating user's result of each inquiry in described first inquiry-results set
Webpage click frequency, and therefrom filter out user's clicking rate and generate the second inquiry-knot more than the inquiry of threshold value and results web page
Fruit set;Second computing unit, be used for calculating in described second inquiry-results set each result described second inquiring about-
The number of times occurred in results set, and therefrom screening occurrence number generates inquiry-knot more than inquiry and the results web page of threshold value
Fruit set.
In an example of the present invention, described computing module includes: arrange unit, is used for arranging described inquiry-result set
Score of initially practising fraud of each inquiry in conjunction, and the initial waste of results web page in described inquiry-results set is set obtains
Point;3rd computing unit, for calculating all result nets being associated with each inquiry in described inquiry-results set
The cheating score that the meansigma methods of the rubbish score of page is inquired about as correspondence;And the 4th computing unit, for calculating with described
The meansigma methods of the cheating score of all inquiries that each results web page in inquiry-results set is associated, if described knot
Really webpage not in spam page then using the meansigma methods of described cheating score as the rubbish score of corresponding webpage, the most more
Change described rubbish score.
Aspect and advantage that the present invention adds will part be given in the following description, and part will become from the following description
Substantially, or by the practice of the present invention recognize.
Accompanying drawing explanation
The present invention above-mentioned and/or that add aspect and advantage will become bright from the following description of the accompanying drawings of embodiments
Aobvious and easy to understand, wherein:
Fig. 1 is the flow chart of the recognition methods of the spam page according to one embodiment of the invention;
Fig. 2 is the pretreated log organization structure figure according to one embodiment of the invention;
Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention;
Fig. 4 is the frame diagram of the identification system of the spam page according to another embodiment of the present invention
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of embodiment is shown in the drawings, the most identical or
Similar label represents same or similar element or has the element of same or like function.Retouch below with reference to accompanying drawing
The embodiment stated is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
In describing the invention, it is to be understood that term " first ", " second ", " the 3rd ", " the 4th "
It is only used for describing purpose, and it is not intended that instruction or hint relative importance or implicit indicate indicated technical characteristic
Quantity.Thus, define " first ", " second ", " the 3rd ", the feature of " the 4th " can express or
Implicitly include one or more this feature.In describing the invention, " multiple " are meant that two or two
Above, unless otherwise expressly limited specifically.
Fig. 1 is the flow chart of the recognition methods of the spam page according to one embodiment of the invention.As it is shown in figure 1, according to
The recognition methods of the spam page of the embodiment of the present invention, comprises the following steps:
Step S101, obtains the inquiry log of search engine and inquiry log carries out pretreatment acquisition pretreatment inquiry day
Will, wherein, pretreatment inquiry log includes multiple queries and results web page.
Specifically, first obtain the inquiry log of search engine, and inquiry log is converted to GBK form.Then, right
Inquiry log after conversion carries out arranging acquisition pretreatment inquiry log, and the structure chart of its pretreatment inquiry log, such as Fig. 2
Shown in.Table 1 is the content that search engine inquiry daily record includes after pretreatment.
Table 1
In one embodiment of the invention, the daily record used includes search dog search engine on March 1st, 2011
All inquiries to the 9 day time of 9 days.Wherein, comprise 8,443,963 different inquiries, 12,470,865
Individual different webpage clicking, these webpages belong to 1,055,001 different website.The information that daily record includes such as table 2
Shown in.
Table 2
The log information of table 2 contains enough items of information for search engine automatic Evaluation, therefore can utilize this
Individual daily record carries out the performance evaluation of each Chinese search engine.
Step S102, filter out from the multiple queries and results web page of pretreatment inquiry log inquiry user's clicking rate and
The occurrence number of results web page is more than the inquiry-results set of threshold value.
Specifically, each inquiry participle to pretreatment inquiry log is multiple key word, and by each key word and user
Click results web page build the first inquiry-results set.Then the use of each inquiry in the first inquiry-results set is calculated
Family results web page click frequency, and therefrom filter out user's clicking rate and generate second more than the inquiry of threshold value and results web page and look into
Inquiry-results set, then calculate that each result in the second inquiry-results set occurs in the second inquiry-results set time
Number, and therefrom screening occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.
Step S103, from inquiry-results set, artificial screening extracts multiple spam page and generates spam page sample set
Close.
Specifically, from inquiry-results set, randomly draw the Search Results of multiple quantity, such as, 1000 inquiries-
As a result, and results web page therein is made whether the mark into spam page, until the spam page quantity marked out reaches
To predetermined quantity, such as, mark is stopped when 200, if the quantity of spam page does not reaches predetermined quantity, then from looking into
Inquiry-results set continues extraction 1000 be labeled, by that analogy, until spam page quantity reaches predetermined quantity
Till.Using the spam page that marks out as spam page sample set.
Step S104, adds up to according to inquiry-results set and spam page sample set and calculates each result in inquiry-results set
The rubbish score of webpage and the cheating score of each inquiry.
Specifically, arrange in inquiry-results set each inquiry initially practises fraud to be divided into 0, and arranges inquiry-result
The initial waste score of results web page in set, if the results web page in inquiry-results set is at spam page sample set
In conjunction, then corresponding initial waste score being set to 1, otherwise the initial waste score of its correspondence is set to 0.Then,
Calculate the meansigma methods of rubbish score of all results web page being associated with each inquiry in inquiry-results set as right
The cheating score that should inquire about.Finally, calculate with inquire about-results set in all inquiries that are associated of each results web page
The meansigma methods of cheating score, if results web page is not in spam page, using the meansigma methods of cheating score as corresponding net
The rubbish score of page, does not the most change rubbish score.In an embodiment of the present invention, by above-mentioned rubbish score and cheating
The update method of score is sequentially repeated repeatedly generally 20-30 time, and the final rubbish obtained must be divided into results web page
Rubbish score.
Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention.Such as Fig. 3
Shown in, inquiry-results set contains the corresponding relation inquired about between result, and the size of strength of association between the two
Then by the frequency of occurrences of inquiry-results set (in figure 3 by wiiRepresent) record.Small-scale rubbish from manual mark
Webpage sample set is set out, and can calculate the spam page score of each webpage with progressive alternate.Assume URL1For spam page
Webpage (its rubbish must be divided into 1) in sample set, and URL2It not that the webpage in spam page sample set is (at the beginning of it
Beginning rubbish must be divided into 0), then Query1And Query3Key word cheating score during iteration is URL for the first time1And URL2
Spam page score averages (can be average by equal weight, it is also possible to by strength of association size weighted average);
Further, URL2Spam page score value be Query1And Query3Key word cheating score averages (can by etc.
Weight is average, it is also possible to by strength of association size weighted average), it is achieved thereby that spam page score is from sample set
Close the diffusion of other webpages.By that analogy, the spam page score of all webpages can i.e. be calculated.
Step S105, is rubbish net by the results web page that the rubbish score of results web page in inquiry-results set is more than threshold value
Page, and results web page is added in spam page set.
In one embodiment of the invention, the rubbish score threshold of spam page criterion can according to circumstances depending on, example
As, it is set to 0.8.The spam page identified is added in spam page set as the data identifying spam page
Use.
Method according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page
Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits
Ying Xing.
Fig. 4 is the frame diagram of the identification system of the spam page according to another embodiment of the present invention.As shown in Figure 4, root
Pretreatment module 100, screening module 200, extraction module is included according to the identification system of the spam page of the embodiment of the present invention
300, computing module 400, judge module 500 and processing module 600.
Pretreatment module 100 is for obtaining the inquiry log of search engine and inquiry log carrying out pretreatment acquisition pretreatment
Inquiry log, wherein, pretreatment inquiry log includes multiple queries and results web page.
In one embodiment of the invention, pretreatment module 100 includes obtaining converting unit 110 and pretreatment unit
120。
Obtain converting unit 110 and be used for obtaining the inquiry log of search engine, and inquiry log is converted to GBK form.
Pretreatment unit 120 obtains pretreatment inquiry log for carrying out arranging to the inquiry log after conversion.
In one embodiment of the invention, obtain the inquiry log of search engine, and inquiry log Unified coding is changed
For GBK form.Carry out arranging and filtering useless information acquisition pretreatment inquiry log to the inquiry log after conversion, Fig. 2
Structure chart for pretreatment inquiry log.
Screening module 200 is for filtering out user's point of inquiry from the multiple queries and results web page of pretreatment inquiry log
Hit the occurrence number inquiry-results set more than threshold value of rate and results web page.
In one embodiment of the invention, screening module 200 includes construction unit the 210, first computing unit 220 and
Second computing unit 230.
It is multiple key word that construction unit 210 is used for each inquiry participle to pretreatment inquiry log, and by each key
Word builds the first inquiry-results set with the click results web page of user.
First computing unit 220 clicks on frequency for calculating user's results web page of each inquiry in the first inquiry-results set
Rate, and therefrom filter out user's clicking rate and generate the second inquiry-results set more than the inquiry of threshold value and results web page.
Second computing unit 230 is for calculating in the second inquiry-results set each result in the second inquiry-results set
The number of times occurred, and therefrom screening occurrence number generates inquiry-results set more than inquiry and the results web page of threshold value.
In one embodiment of the invention, from inquiry-results set, the Search Results of multiple quantity, example are randomly drawed
As, 1000 inquiry-results, and results web page therein is made whether the mark into spam page, until marking out
Spam page quantity reach predetermined quantity, such as, when 200 stop mark, if the quantity of spam page does not reaches
Predetermined quantity, then continue extraction 1000 from inquiry-results set and be labeled, by that analogy, until spam page
Till quantity reaches predetermined quantity.Using the spam page that marks out as spam page sample set.
Extraction module 300 extracts multiple spam page for artificial screening from inquiry-results set and generates spam page
Sample set.
In one embodiment of the invention, from inquiry-results set, the Search Results of multiple quantity, example are randomly drawed
As, 1000 inquiry-results, and results web page therein is made whether the mark into spam page, until marking out
Spam page quantity reach predetermined quantity, such as, when 200 stop mark, if the quantity of spam page does not reaches
Predetermined quantity, then continue extraction 1000 from inquiry-results set and be labeled, by that analogy, until spam page
Till quantity reaches predetermined quantity.Using the spam page that marks out as spam page sample set.
Computing module 400 is for adding up in calculation inquiry-results set every according to inquiry-results set and spam page sample set
The rubbish score of individual results web page and the cheating score of each inquiry.
In one embodiment of the invention, computing module 400 includes arranging unit the 410, the 3rd computing unit 420 and
4th computing unit 430.
Unit 410 is set for arranging in inquiry-results set score of initially practising fraud of each inquiry, and arrange inquiry-
The initial waste score of results web page in results set.
3rd computing unit 420 for calculate with inquire about-results set in all results web page of being associated of each inquiry
The cheating score inquired about as correspondence of the meansigma methods of rubbish score.
4th computing unit 430 for calculate with inquire about-results set in all inquiries that are associated of each results web page
The meansigma methods of cheating score, if results web page is not in spam page, using the meansigma methods of cheating score as corresponding net
The rubbish score of page, does not the most change rubbish score.
In an embodiment of the present invention, it is sequentially repeated repeatedly updates rubbish by the 3rd computing unit and the 4th computing unit
Score and cheating score are generally 20-30 time, and the final rubbish obtained must be divided into the rubbish score of results web page.
Fig. 3 is the calculating schematic diagram of the rubbish score of the inquiry-results set according to one embodiment of the invention.Such as Fig. 3
Shown in, inquiry-results set contains the corresponding relation inquired about between result, and the size of strength of association between the two
Then by the frequency of occurrences of inquiry-results set (in figure 3 by wiiRepresent) record.Small-scale rubbish from manual mark
Webpage sample set is set out, and can calculate the spam page score of each webpage with progressive alternate.Assume URL1For spam page
Webpage (its rubbish must be divided into 1) in sample set, and URL2It not that the webpage in spam page sample set is (at the beginning of it
Beginning rubbish must be divided into 0), then Query1And Query3Key word cheating score during iteration is URL for the first time1And URL2
Spam page score averages (can be average by equal weight, it is also possible to by strength of association size weighted average);
Further, URL2Spam page score value be Query1And Query3Key word cheating score averages (can by etc.
Weight is average, it is also possible to by strength of association size weighted average), it is achieved thereby that spam page score is from sample set
Close the diffusion of other webpages.By that analogy, the spam page score of all webpages can i.e. be calculated.
Whether judge module 500 is more than threshold value for the rubbish score judging results web page in inquiry-results set, if
It it is then spam page more than threshold value.In one embodiment of the invention, the rubbish score threshold of spam page criterion
Can according to circumstances depending on, such as, be set to 0.8 etc..
Processing module 600 is for adding to results web page in spam page set.The spam page identified is added
Use as the data identifying spam page in spam page set.
System according to embodiments of the present invention, by search engine inquiry daily record data, thus finds and identifies spam page
Reducing algorithm complex, and structure and parameter is simple, recognition result is comprehensively reliable, has preferable generalization and fits
Ying Xing.
Although above it has been shown and described that embodiments of the invention, it is to be understood that above-described embodiment is exemplary
, it is impossible to being interpreted as limitation of the present invention, those of ordinary skill in the art is without departing from the principle of the present invention and objective
In the case of above-described embodiment can be changed within the scope of the invention, revise, replace and modification.
Claims (4)
1. the recognition methods of a spam page, it is characterised in that comprise the following steps:
S1: obtain the inquiry log of search engine and described inquiry log is carried out pretreatment acquisition pretreatment
Inquiry log, wherein, described pretreatment inquiry log includes multiple queries and results web page;
S2: filter out described inquiry from the multiple queries and results web page of described pretreatment inquiry log
The occurrence number of user's clicking rate and described results web page is more than the inquiry-results set of threshold value;
S3: artificial screening extracts multiple spam page and generates rubbish net from described inquiry-results set
Page sample set;
S4: add up to according to described inquiry-results set and spam page sample set and calculate described inquiry-result set
The rubbish score of each results web page and the cheating score of each inquiry in conjunction;And
S5: if the rubbish score of results web page is more than threshold value, described knot in described inquiry-results set
Really webpage is spam page, and described results web page is added in described spam page set,
Described step S2 specifically includes:
S21: each inquiry participle to described pretreatment inquiry log is multiple key word, and by described
Each key word of multiple key words and the click results web page of user build the first inquiry-results set;
S22: the user's results web page calculating each inquiry in described first inquiry-results set clicks on frequency
Rate, and therefrom filter out user's clicking rate and generate the second inquiry-knot more than the inquiry of threshold value and results web page
Fruit set;
S23: calculate in described second inquiry-results set each result at described second inquiry-result set
The number of times occurred in conjunction, and therefrom screening occurrence number generates inquiry more than inquiry and the results web page of threshold value
-results set,
Described step S4 specifically includes:
S41: the score of initially practising fraud of each inquiry in described inquiry-results set is set, and institute is set
State the initial waste score of results web page in inquiry-results set;
S42: all results web page that calculating is associated with each inquiry in described inquiry-results set
The cheating score that the meansigma methods of rubbish score is inquired about as correspondence;And
S43: all inquiries that calculating is associated with each results web page in described inquiry-results set
The meansigma methods of cheating score, if described results web page is not in spam page, by described cheating score
Meansigma methods, as the rubbish score of corresponding webpage, does not the most change described rubbish score.
The recognition methods of spam page the most according to claim 1, it is characterised in that described step
S1 specifically includes:
S11: obtain the inquiry log of search engine, and described inquiry log is converted to GBK form;
S12: carry out the inquiry log after described conversion arranging acquisition pretreatment inquiry log.
3. the identification system of a spam page, it is characterised in that including:
Pretreatment module, for obtaining the inquiry log of search engine and described inquiry log being carried out pre-place
Reason obtains pretreatment inquiry log, and wherein, described pretreatment inquiry log includes multiple queries and result net
Page;
Screening module, for filtering out from the multiple queries and results web page of described pretreatment inquiry log
User's clicking rate of described inquiry and the occurrence number of described results web page are more than the inquiry-result set of threshold value
Close;
Extraction module, extracts multiple spam page for artificial screening from described inquiry-results set
Generate spam page sample set;
Computing module, for adding up to calculation described according to described inquiry-results set and spam page sample set
The rubbish score of each results web page and the cheating score of each inquiry in inquiry-results set;
Judge module, for judging that in described inquiry-results set, the rubbish score of results web page is the biggest
In threshold value, it is then spam page if greater than threshold value;And
Processing module, for described results web page is added in described spam page set,
Described screening module includes:
Construction unit, being used for each inquiry participle to described pretreatment inquiry log is multiple key word,
And the click results web page of each key word of the plurality of key word with user is built the first inquiry-knot
Fruit set;
First computing unit, for calculating user's knot of each inquiry in described first inquiry-results set
Really webpage click frequency, and therefrom filter out user's clicking rate and generate more than the inquiry of threshold value and results web page
Second inquiry-results set;
Second computing unit, is used for calculating in described second inquiry-results set each result described
The number of times occurred in two inquiries-results set, and therefrom screening occurrence number is more than inquiry and the knot of threshold value
Really auto-building html files inquiry-results set,
Described computing module includes:
Unit is set, for arranging the score of initially practising fraud of each inquiry in described inquiry-results set,
And the initial waste score of results web page in described inquiry-results set is set;
3rd computing unit, is associated with each inquiry in described inquiry-results set for calculating
The cheating score that the meansigma methods of the rubbish score of all results web page is inquired about as correspondence;And
4th computing unit is relevant to each results web page in described inquiry-results set for calculating
The meansigma methods of the cheating score of all inquiries of connection, if described results web page is not in spam page, will
The meansigma methods of described cheating score, as the rubbish score of corresponding webpage, is not the most changed described rubbish and is obtained
Point.
The identification system of spam page the most according to claim 3, it is characterised in that described pre-
Processing module includes:
Obtain converting unit, for obtaining the inquiry log of search engine, and described inquiry log is changed
For GBK form;
Pretreatment unit, obtains pretreatment inquiry day for carrying out the inquiry log after described conversion arranging
Will.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310029963.XA CN103064984B (en) | 2013-01-25 | 2013-01-25 | The recognition methods of spam page and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310029963.XA CN103064984B (en) | 2013-01-25 | 2013-01-25 | The recognition methods of spam page and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103064984A CN103064984A (en) | 2013-04-24 |
CN103064984B true CN103064984B (en) | 2016-08-10 |
Family
ID=48107614
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310029963.XA Active CN103064984B (en) | 2013-01-25 | 2013-01-25 | The recognition methods of spam page and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103064984B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598460B (en) * | 2013-10-30 | 2018-11-02 | 腾讯科技(深圳)有限公司 | The recognition methods of rubbish Anchor Text and device |
CN103595732B (en) * | 2013-11-29 | 2017-09-15 | 北京奇虎科技有限公司 | A kind of method and device of network attack evidence obtaining |
CN104933055B (en) * | 2014-03-18 | 2020-01-31 | 腾讯科技(深圳)有限公司 | Webpage identification method and webpage identification device |
CN106844371B (en) * | 2015-12-03 | 2020-09-08 | 阿里巴巴集团控股有限公司 | Search optimization method and device |
CN106844685B (en) * | 2017-01-26 | 2020-07-28 | 百度在线网络技术(北京)有限公司 | Method, device and server for identifying website |
CN110147472B (en) * | 2017-07-14 | 2021-10-15 | 北京搜狗科技发展有限公司 | Detection method and device for cheating sites and detection device for cheating sites |
CN109255069A (en) * | 2018-07-31 | 2019-01-22 | 阿里巴巴集团控股有限公司 | A kind of discrete text content risks recognition methods and system |
CN109361957B (en) * | 2018-10-18 | 2021-02-12 | 广州酷狗计算机科技有限公司 | Method and device for sending praise request |
CN109831451A (en) * | 2019-03-07 | 2019-05-31 | 北京华安普特网络科技有限公司 | Preventing Trojan method based on firewall |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101814093A (en) * | 2010-04-02 | 2010-08-25 | 南京邮电大学 | Similarity-based semi-supervised learning spam page detection method |
CN102184208A (en) * | 2011-04-29 | 2011-09-14 | 武汉慧人信息科技有限公司 | Junk web page detection method based on multi-dimensional data abnormal cluster mining |
CN102750380A (en) * | 2012-06-27 | 2012-10-24 | 山东师范大学 | Page sorting method in combination with difference feature distribution and link feature |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8639773B2 (en) * | 2010-06-17 | 2014-01-28 | Microsoft Corporation | Discrepancy detection for web crawling |
-
2013
- 2013-01-25 CN CN201310029963.XA patent/CN103064984B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101814093A (en) * | 2010-04-02 | 2010-08-25 | 南京邮电大学 | Similarity-based semi-supervised learning spam page detection method |
CN102184208A (en) * | 2011-04-29 | 2011-09-14 | 武汉慧人信息科技有限公司 | Junk web page detection method based on multi-dimensional data abnormal cluster mining |
CN102750380A (en) * | 2012-06-27 | 2012-10-24 | 山东师范大学 | Page sorting method in combination with difference feature distribution and link feature |
Also Published As
Publication number | Publication date |
---|---|
CN103064984A (en) | 2013-04-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103064984B (en) | The recognition methods of spam page and system | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN105138558B (en) | The real time individual information collecting method of content is accessed based on user | |
CN110647629A (en) | Multi-document machine reading understanding method for multi-granularity answer sorting | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
CN104679825B (en) | Macroscopic abnormity of earthquake acquisition of information based on network text and screening technique | |
CN109522562B (en) | Webpage knowledge extraction method based on text image fusion recognition | |
TWI695277B (en) | Automatic website data collection method | |
DE112013004082T5 (en) | Search system of the emotion entity for the microblog | |
CN107239564B (en) | Text label recommendation method based on supervision topic model | |
CN107577671A (en) | A kind of key phrases extraction method based on multi-feature fusion | |
CN108345686A (en) | A kind of data analysing method and system based on search engine technique | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN105740227A (en) | Genetic simulated annealing method for solving new words in Chinese segmentation | |
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
CN105787121B (en) | A kind of microblogging event summary extracting method based on more story lines | |
CN103440315B (en) | A kind of Web page cleaning method based on theme | |
CN103530429A (en) | Webpage content extracting method | |
CN110555154B (en) | Theme-oriented information retrieval method | |
CN112287240A (en) | Case microblog evaluation object extraction method and device based on double-embedded multilayer convolutional neural network | |
CN113901228B (en) | Cross-border national text classification method and device fusing domain knowledge graph | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums | |
CN113971398A (en) | Dictionary construction method for rapid entity identification in network security field | |
CN112418269B (en) | Social media network event propagation key time prediction method, system and medium | |
CN112966507A (en) | Method, device, equipment and storage medium for constructing recognition model and identifying attack |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |