CN106294535B

CN106294535B - The recognition methods of website and device

Info

Publication number: CN106294535B
Application number: CN201610571258.6A
Authority: CN
Inventors: 邹红建; 方高林
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2016-07-19
Filing date: 2016-07-19
Publication date: 2019-06-25
Anticipated expiration: 2036-07-19
Also published as: CN106294535A

Abstract

The embodiment of the invention discloses a kind of recognition methods of website and devices.The described method includes: in the set time period, obtaining and updating the page with associated at least two history in website to be verified；The page is updated to each history and carries out Context resolution, obtains at least one content domain corresponding with each history refresh page face；The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain；According to comentropy calculated result, anomalous identification is carried out to the website to be verified.The discrimination for the Information Entropy Features that technical solution of the present invention uses is good, calculate simple, timeliness height, it can solve the technical issues of artificial mark or data preparation that discrimination brought by existing cheating website identification technology is not high, real-time is poor and needs introducing is additional work, existing website identification technology is optimized, the recognition accuracy of abnormal website is improved.

Description

The recognition methods of website and device

Technical field

The present embodiments relate to the recognition methods of computer processing technology more particularly to a kind of website and devices.

Background technique

The information for including in document needed for information retrieval refers to the document needed for searching in the set of information resources or searches The process of content.Search engine is exactly the information retrieval tool for searching internet information.The appearance of search engine allows people Information is obtained from vast resources becomes convenient.After search engine occurs, consequent is webpage cheating problem.For economy Interests or other interests, cheating website mislead search engine by various methods, are tied with improving its page in search engine sequence Position sequence in fruit.Since cheating Website quality is not generally high, the usually advertisement of, gambling etc. especially pornographic comprising advertisement, User experience can be seriously affected, therefore the major issue that website identification of practising fraud belongs in information retrieval.The identification of cheating website The promotion of technology is of great significance to the effect for promoting search engine.

Currently, the cheat method variation of cheating website is frequently, but it can generally be summarized as content cheating and link to make Disadvantage two major classes.Content cheating is generally by piling up the mode of focus inquiry (also referred to as Query) in the page to improve the page Sequence in search-engine results；Link cheating is primarily directed to the page scoring algorithm of calculating PageRank (also referred to as It is the nomography of prototype for PageRank), by building linking relationship to improve weight of website, link cheating further includes passing through page The cheating mode that face redirects.Cheating website identification technology is always one of industry research hotspot, including naive Bayesian, Logistic Regression (also referred to as logistic regression), it SVM (Support Vector Machine, support vector machines), integrates A variety of machine learning methods such as study, deep learning have application, and the feature used includes content characteristic, chain feature etc..? Have and is identified using external informations such as user's click behaviors.

The major defect of existing cheating website identification technology is: the not significant, content of text for page structure feature On do not carry out the cheating page piled up of cheating word, it is difficult to identification in time.The graph model algorithm for relying on link relationship characteristic is complicated, It is difficult to meet the needs of identifying in real time；Emerging general Websites and compare minority website, how with emerging cheating net It stands and mutually distinguishes and one of difficulty；In addition, it is exactly network upgrade speed of practising fraud that cheating website identification mission, which faces a major challenge, Fastly, existing cheating identifying schemes or identification model effect gradually fail over time.Enhancing study and Active Learning energy It is enough partially to solve the problems, such as this, but need to introduce additional artificial mark or data preparation work.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of recognition methods of website and device, to optimize existing website Identification technology improves the recognition accuracy of abnormal website.

In a first aspect, the embodiment of the invention provides a kind of recognition methods of website, comprising:

In the set time period, it obtains and updates the page with associated at least two history in website to be verified；

The page is updated to each history and carries out Context resolution, obtains corresponding with each history refresh page face at least one A content domain；

The content change in the page in identical content domain is updated according to each history, calculates the information of each content domain Entropy；

According to comentropy calculated result, anomalous identification is carried out to the website to be verified.

In second aspect, the embodiment of the invention also provides a kind of identification devices of website, comprising:

History updates page acquisition module, in the set time period, obtaining and website to be verified associated at least two A history updates the page；

Content domain obtains module, carries out Context resolution for updating the page to each history, obtains and each history At least one corresponding content domain of refresh page face；

Content domain comentropy computing module becomes for updating the content in the page in identical content domain according to each history Change, calculates the comentropy of each content domain；

Anomalous identification module, for carrying out anomalous identification to the website to be verified according to comentropy calculated result.

The embodiment of the present invention in the set time period, obtains and the associated at least two history refresh page in website to be verified Face；The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one Hold domain；The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain； According to comentropy calculated result, anomalous identification is carried out to the website to be verified, since the discrimination of Information Entropy Features is good, calculates Simply, timeliness is high, can solve that discrimination brought by the identification technology of existing cheating website is not high, real-time is poor and needs The technical issues of additional artificial mark or data preparation work is introduced, existing website identification technology is optimized, improves The recognition accuracy of abnormal website.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention one provides；

Fig. 2 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention；

Fig. 3 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention three provides；

Fig. 4 is a kind of structure chart of the identification device for website that the embodiment of the present invention four provides.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing to of the invention specific real Example is applied to be described in further detail.It is understood that specific embodiment described herein is used only for explaining the present invention, Rather than limitation of the invention.

It also should be noted that only the parts related to the present invention are shown for ease of description, in attached drawing rather than Full content.It should be mentioned that some exemplary embodiments are described before exemplary embodiment is discussed in greater detail At the processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart, It is that many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be by again It arranges.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing. The processing can correspond to method, function, regulation, subroutine, subprogram etc..

In order to be easy to understand hereinafter, inventive concept of the invention is simply introduced first:

Inventors discovered through research that: from purpose, cheating website is allowed in website and embeds to obtain higher ranked Ad content obtain more high access.Wherein, the advertisement classification of website of practising fraud generally more is concentrated, majority is gambling, it is pornographic, Beautifying medical, gun instrument etc..The cheating of cheating website is that have mark governed.In order to allow search engine to include and obtain height Sorting position, cheating website often updates content of pages, the inquiry of current popular high frequency is added in the page；Since cost is asked Topic, cheating website can generally replicate same page content.In order to cope with search engine anti-cheating strategy, website of practising fraud it is interior Appearance, pattern, network address are also required to frequent updating.

As the above analysis: cheating network upgrade is frequent, and advertising information is contained in website of practising fraud, and these advertising informations Within some period, update and infrequently.That is, in some important positions, there are unreasonable redundancies for cheating website, and Normal website especially high quality website does not need to make this redundancy, because can not provide valuable letter like that more Breath.

The concept of entropy is introduced information theory by founder's Shannon of information theory, as the measurement to information content size.Information content Size it is related to the size of uncertainty, entropy is higher, uncertain higher, other information required for describing clearly It measures bigger.

Namely: from the point of view of information theory, if normal website updates frequently, illustrate containing much information for its offer, entropy Value can be larger；If updated, the information content for infrequently illustrating that website provides is small, then entropy is smaller.Cheating website often updates, in advance Phase, its entropy was larger, but certain content domains or certain objects, because containing advertising information, these advertising information renewal speed are slow, Its entropy is caused to become smaller, i.e., the practical entropy of certain content domains directly has differences with expected entropy.By calculating website of practising fraud The entropy and its difference degree in different content domain can help effectively to identify cheating website.

By above-mentioned analysis, this concept of comentropy is introduced the identification of abnormal website by the proposition of inventor's creativeness Cheng Zhong carries out anomalous identification to the website by calculating the comentropy of one or more content domain in a website.

Embodiment one

Fig. 1 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention one provides, and the method for the present embodiment can It is executed with the identification device by website, which can be realized by way of hardware and/or software, and can generally be integrated in use In the server for realizing abnormal website identification function.The method of the present embodiment specifically includes:

110, it in the set time period, obtains and updates the page with associated at least two history in website to be verified.

In the present embodiment, the website to be verified specifically refers to the website for needing to carry out anomalous identification.Wherein it is possible to will The whole websites included in search engine carry out anomalous identification as website to be verified, still, it is contemplated that abnormal website is (typical , website of practising fraud) in order to obtain the higher ranking results of position sequence in a search engine, content of pages can be often updated, therefore can There is the new generation page using selection or have the website for updating the page as website to be verified, this also contributes to reducing calculation amount.

As previously mentioned, core of the invention be by analysis one website to be verified in each content domain comentropy come Anomalous identification is carried out to the website, and comentropy is mainly to measure the uncertainty degree of the content occurred in content domain, therefore need (for example, 1 hour, 1 day or 1 week etc.) is obtained in set period of time, more with associated at least two history in website to be verified New page updates the content updated in the page by analyzing the history, determines the letter of each content domain in the website to be verified Cease entropy.

Wherein, at least two history associated with website to be verified update the page may include: with it is described to be verified Corresponding at least two history of the website domain name of website updates the page；And/or with the same webpage in the website to be verified Corresponding at least two history in location updates the page.

In a specific example, the website domain name of a website to be verified is www.A.com, available setting time Whole history corresponding with the website domain name update the page as with the associated history in website to be verified and update the page in section； Further, it is contemplated that can simultaneously include the subpage frame of multiple and different types in a website (for example, a news website In simultaneously include subpage frames such as " current events ", " amusement " and " sport "), in order to carry out more fine-grained analysis, can also obtain Take in the website to be verified with same web page address (such as: www.A.com/B) corresponding whole history updates the page, as The page is updated with the associated history in website to be verified.

120, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face A few content domain.

It in general, include different types of data content in a page, in the present embodiment, by above-mentioned inhomogeneity The data content of type is defined as domain.Such as: text header, text body, picture header, picture and picture accurate description text This etc..By page parsing, namely to the HTML (HyperText Markup Language, hypertext markup language) of the page File is analyzed, the page can be divided into different domains and extract the text for including in these domains, picture by a page etc. Content.

In view of the computation complexity of follow-up entropy, in the present embodiment, the content chosen when calculating comentropy Domain may include at least one of following: text header domain, picture domain, picture header domain, picture describe textview field.

Wherein, the text header domain specifically refers to the page location where one or more text header, the figure Piece domain specifically refers to the page location where one or more picture, and the picture header domain specifically refers to corresponding with picture Page location where one or more picture header, the picture describe textview field and specifically refer to one corresponding with picture Or multiple pictures precisely describe the page location where text.

130, the content change in the page in identical content domain is updated according to each history, calculates each content domain Comentropy.

By the related notion of comentropy it is found that the content change in a content domain is more frequent, content in the content domain Uncertainty it is bigger, then the comentropy of the content domain is also bigger；Conversely, the content in a content domain is more fixed, this is interior The uncertainty for holding the content in domain is smaller, then the comentropy of the content domain is also just smaller.

Wherein, the calculation formula of comentropy specifically:

Wherein, x has n kind value: x₁…x_i…x_n, corresponding probability are as follows: P (x₁)…P(x_i)…P(x_n)。

Typically, the frequency of occurrence in the page can be updated in each history according to different content in content domain, calculated in each Hold the comentropy in domain.

140, according to comentropy calculated result, anomalous identification is carried out to the website to be verified.

It, can be by the comentropy meter of content domain each in website to be verified in a preferred embodiment of the present embodiment It calculates result to be compared with the comentropy of each content domain of a reliable website, and then abnormal knowledge is carried out to the website to be verified Not；

It, can also be by the letter in different content domain in website to be verified in another preferred embodiment of the present embodiment Breath entropy is compared, and then carries out anomalous identification to the website to be verified；

It, can also be using the comentropy calculated result as at least in another preferred embodiment of the present embodiment The Information Entropy Features value and other abnormal website identification feature values are combined by one Information Entropy Features value, to it is described to It verifies website and carries out anomalous identification.

In general, the prior art mainly carries out anomalous identification to a website to be verified using classifier, by One or more abnormal website identification feature value (typical, content characteristic, chain and connect feature etc.) is added in the classifier Complete the identification to abnormal website.In the present embodiment, in addition to that directly can identify it to abnormal website is carried out by use information entropy It outside, can also be on the basis of existing abnormal website identification technology, by the comentropy meter of each content domain in website to be verified Result is calculated as one or more Information Entropy Features value, by the Information Entropy Features value and other abnormal website identification feature values It is input in classifier together, after combining with existing abnormal website identification technology, exception knowledge is carried out to the website to be verified Not, to further increase the recognition accuracy of abnormal website.

The embodiment of the present invention in the set time period, obtains and the associated at least two history refresh page in website to be verified Face；The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one Hold domain；The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain； According to comentropy calculated result, anomalous identification is carried out to the website to be verified, since the discrimination of Information Entropy Features is good, is calculated Simply, timeliness is high, can solve that discrimination brought by the identification technology of existing cheating website is not high, real-time is poor and needs The technical issues of additional artificial mark or data preparation work is introduced, existing website identification technology is optimized, improves The recognition accuracy of abnormal website.

Embodiment two

Fig. 2 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention.The present embodiment is with above-mentioned reality It applies and optimizes based on example, in the present embodiment, will obtain and website to be verified associated at least two in the set time period The optimization of a history refresh page mask body are as follows: in the set time period, grabbed by web crawlers it is newly generated in network, and/or There is the page of update；It, will be described in corresponding with clustering cluster website be used as after the page of crawl is clustered according to website domain name Website to be verified；According to the page for including in the clustering cluster, obtain and associated at least two history in website to be verified Update the page；

Meanwhile the content change in the page in identical content domain will be updated according to each history, calculate each content The comentropy in domain specifically optimizes are as follows: respectively in the same target content domain that each history updates the page, extracts at least one Compare object；According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, institute is calculated State the probability of occurrence for comparing object；According to the probability of occurrence for comparing object, letter corresponding with the object content domain is calculated Cease entropy.

Correspondingly, the method for the present embodiment specifically includes:

210, it in the set time period, is grabbed by web crawlers newly generated in network, and/or has the page of update.

In the present embodiment, it is contemplated that abnormal website, especially cheating website are generally the ratio updated more frequently website. Therefore, it can obtain first and be grabbed by web crawlers newly generated in network, and have the page of update, by these pages Face is merged to cluster according to website, can determine corresponding website to be verified in turn.

220, after the page of crawl being clustered according to website domain name, will corresponding with clustering cluster website as it is described to Verify website.

230, it according to the page for including in the clustering cluster, obtains and associated at least two history in website to be verified Update the page.

Wherein, if it is described with the associated at least two history refresh page mask body in website to be verified be with it is described to be verified Corresponding at least two history of the website domain name of website updates the page, then according to the page for including in the clustering cluster, obtain with The associated at least two history refresh page mask body in website to be verified may include:

The whole pages that will include in the clustering cluster, directly as with the associated history refresh page in website to be verified Face；

If described be and the website to be verified with the associated at least two history refresh page mask body in website to be verified In corresponding at least two history of same web page address update the page, then according to the page for including in the clustering cluster, obtain May include: with the associated at least two history refresh page mask body in website to be verified

By the page for including in the clustering cluster according to URL (Uniform Resource Locator, unified resource positioning Symbol) address is grouped, wherein and the page in same grouping corresponds to an identical address URL；It obtains and is wrapped in same grouping The page included updates the page as with the associated history in website to be verified.

240, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face A few content domain.

250, it respectively in the same target content domain that each history updates the page, extracts at least one and compares object.

In the present embodiment, if the content in the object content domain includes text, the comparison object be can wrap It includes: urtext, semantic signature or semantic classes；If the content in the object content domain includes picture, the ratio It may include: original image or picture classification to object.

Wherein, the urtext specifically refers to the content of text directly occurred in some content domain, such as: text header Content of text in domain are as follows: " 2016.6.17 days, XX company lists in the U.S. ", then above-mentioned content of text is urtext；

Semantic signature is the improvement to urtext, i.e., carries out semantics recognition and processing to urtext, retain original text Core semantic content in this, and it is expressed as the combination of several core words, the combination of this core word, referred to as semantic label Name.Continuous precedent, for " 2016.6.17 day, XX company is in U.S.'s listing " this urtext, corresponding to semantic signature be " XX company, the U.S., listing "；

Semantic classes refers to the semantic classes of raw text content.Continuous precedent, for " 2016.6.17 days, XX company was in the U.S. Listing " this urtext, corresponding to semantic classes be " finance and economics ".

It is understood that urtext, semantic signature and semantic classes represent the different information type of thickness granularity, Correspondingly, by the comentropy for calculating these three information types, the different informational content measure result of available thickness granularity.? When practical application, those skilled in the art can choose the letter of different thicknesses granularity according to actual abnormal website accuracy of identification Type is ceased as the comparison object.

Similar, the original image specifically refers to the image content directly occurred in some content domain, the picture The classification of classification, in particular to picture under certain classification system.

Currently, it will be appreciated by persons skilled in the art that the comparison pair of other forms can also be obtained in content domain As, in fact, it is all can clear definition and the page column of identification or the data of page info type can be used as it is described Object is compared, the present embodiment is to this and is not limited.

260, the frequency of occurrence according to the comparison object in the object content domain that each history updates the page, calculates The probability of occurrence for comparing object.

In a specific example, within one day, website to be verified corresponds to three history and updates the page, and history updates The page 1, history update the page 2 and history updates the page 3, and the object content domain of selection is text header domain, the comparison of selection Object is urtext.

Wherein, the urtext occurred in the text header domain that history updates the page 1 includes: text header 1, text mark Topic 2 and text header 3；In the text header domain that history updates in the page 2 urtext that occurs include: text header 1, Text header 3 and text header 4；The urtext occurred in the text header domain that history updates in the page 3 includes: text Title 3 and text header 5.

There are 8 text headers altogether correspondingly, updating in the page in above three history, text header 1 is above-mentioned Three history update to be occurred 2 times altogether in the page, and then can determine that probability of occurrence corresponding with text header is 2/8；Text mark Topic 2 updates in the page in above three history to be occurred 1 time altogether, and then can determine that probability of occurrence corresponding with text header is 1/ 8；Text header 3 updates in the page in above three history to be occurred 3 times altogether, and then can determine appearance corresponding with text header Probability is 3/8；Text header 4 updates in the page in above three history to be occurred 1 time altogether, and then can be determined and text header pair The probability of occurrence answered is 1/8；Text header 5 updates in the page in above three history to be occurred 1 time altogether, and then can be determined and text The corresponding probability of occurrence of this title 5 is 1/8.

270, according to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.

According to comentropy calculation formula, available above-mentioned comentropy H corresponding with the object content domain are as follows:

H=(1/4) log₂4+(1/8)log₂8+(3/8)log₂3/8+(1/8)log₂8+(1/8)log₂8。

280, according to comentropy calculated result, anomalous identification is carried out to the website to be verified.

It is found after the characteristics of inventor is by analyzing various cheating websites: if in multiple history corresponding with same website It updates in the page, the main picture of the page largely repeats (comentropy of picture is small), and picture describes text or text header is fresh See repetition (picture describes text or the comentropy of text header is big), then the website has greater probability for cheating website；In addition, If the other comentropy of picture category and the comentropy of picture header, there are notable difference, which also has greater probability for cheating Website.

Accordingly, in a preferred embodiment of the present embodiment, according to comentropy calculated result, to described to be verified Website carries out anomalous identification

If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, Determine the website to be verified for abnormal website；Or

If the comentropy at least one object content domain corresponding with the website to be verified is less than the second thresholding of setting Value, it is determined that the website to be verified is abnormal website；Or

If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, Then determine the website to be verified for abnormal website.

Wherein, first threshold value, the second threshold value and third threshold value can be preset according to the actual situation, The present embodiment is to this and is not limited.

The technical solution of the present embodiment newly generates or has the page of update by screening in some period, will come from identical The page aggregation of website together, and chooses the mode that website to be verified carries out anomalous identification according to polymerization result, compared to general Whole websites that search engine is included carry out the mode of anomalous identification, under the premise of not dramatically increasing omission factor, Ke Yi great It is big to reduce calculation amount；In addition, by carrying out anomalous identification to website according to the comentropy difference of each content domain in a website Mode, do not need to introduce any reference site, only according to the comentropy difference feature in different content domain in website to be verified, Technical effect that is simple, accurately identifying abnormal website can be realized.

On the basis of the various embodiments described above, in the appearance frequency according to the comparison object in each object content domain It is secondary, before calculating the probability of occurrence for comparing object, can also include:

If it is determined that the comparison object is the simple repeated text of timeliness, then updated in the page in each history, point The associated body matter of object Huo Qu not be compared with described；If updated in the page in different history, compared pair with same target As corresponding body matter is not identical, then the target is compared object tag is different comparison objects.

The reason of being arranged in this way is: when calculating comentropy, it is special to need to the identic text with timeliness Processing.For example, as " one week news flash ", " Domestic Briefs " this headline, it is different in the corresponding body matter of different time, When calculating comentropy, need that body matter is combined to be judged.Namely: the page 1 is updated in history and history updates in the page 2 There is " one week news flash " this comparison object, if only counting the frequency of occurrence of " one week news flash ", the comparison pair The probability of occurrence of elephant is 1.But, it is contemplated that " one week news flash " is the text with timeliness, also to continue history more New page 1 and history, which update, compares body matter corresponding with " one week news flash " in the page 2, if the two is different, can incite somebody to action " the one week news flash " that history updates in " one week news flash " and the history update page 2 in the page 1 is identified as different comparisons pair As, and then can determine that the probability of occurrence of the comparison object is 1/2.

By above-mentioned setting, the accuracy in computation of comentropy can be improved, and then the identification that abnormal website can be improved is quasi- Exactness.

Embodiment three

Fig. 3 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention.The present embodiment is with above-mentioned reality It applies and optimizes based on example, in the present embodiment, the website to be verified will be carried out abnormal according to comentropy calculated result The specific optimization of identification are as follows: according to the data characteristics of the website to be verified, obtained in reliable website list with it is described to be verified The associated reference site in website；Obtain the comentropy of at least one content domain corresponding with the reference site；Described to be tested It demonstrate,proves in website and the reference site, chooses at least one key content domain；According to the website to be verified and the ginseng Examine in website, comentropy corresponding with the key content domain, calculate the website to be verified and the reference site it Between the diversity factor factor；If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is abnormal Website.

Correspondingly, the method for the present embodiment specifically includes:

310, it in the set time period, obtains and updates the page with associated at least two history in website to be verified.

320, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face A few content domain.

330, the content change in the page in identical content domain is updated according to each history, calculates each content domain Comentropy.

340, it according to the data characteristics of the website to be verified, is obtained and the website to be verified in reliable website list Associated reference site.

In the present embodiment, the data characteristics of the website to be verified may include at least one of following: set period of time Interior network upgrade frequency, the new added pages quantity in set period of time and content topic etc..

Wherein, the reliable website list specifically refers to: it is excavated by User action log or the methods of manual sorting, The a collection of reliable website determined.

In the present embodiment, it is contemplated that renewal frequency is similar, the new added pages quantity in set period of time is similar or interior Hold the similar reliable website of theme, can also have certain similitude between the comentropy of each content domain in webpage.Therefore, By the acquisition in reliable website list and the website to be verified reference site similar in the data characteristics, and lead to The comentropy difference for crossing each domain in the reference site and the website to be verified, can identify abnormal website.

350, the comentropy of at least one content domain corresponding with the reference site is obtained.

360, in the website to be verified and the reference site, at least one key content domain is chosen.

Wherein it is possible to obtain the full content domain for including in the website to be verified and the reference site as institute State key content domain, also it is available it is above-mentioned both include one or more important content domain (for example, picture domain with And text header domain etc.) it is used as the key content domain, the present embodiment is to this and is not limited.

370, according in the website to be verified and the reference site, letter corresponding with the key content domain Entropy is ceased, the diversity factor factor between the website to be verified and the reference site is calculated.

In a preferred embodiment of the present embodiment, according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor can specifically include:

In the website to be verified and the reference site, it is poor to obtain comentropy corresponding with same key content domain Value is used as the diversity factor factor.

For example, comentropy corresponding with key content domain 1 is A, corresponding with key content domain 2 in website to be verified Comentropy is B；In reference site, comentropy corresponding with key content domain 1 is C, comentropy corresponding with key content domain 2 For D；

Can then incite somebody to action | A-C | and | B-D | as the diversity factor factor.Wherein, | | represent the symbol that takes absolute value.

In another preferred embodiment of the present embodiment, according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor can specifically include:

In the website to be verified, comentropy corresponding at least two key content domains is constituted into the first information Vector；

In the reference site, comentropy corresponding at least two key content domain is constituted into the second letter Cease vector；

It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor.

Continuous precedent, in website to be verified, comentropy corresponding with key content domain 1 is A, corresponding with key content domain 2 Comentropy be B；In reference site, comentropy corresponding with key content domain 1 is C, information corresponding with key content domain 2 Entropy is D；

Then first information vector corresponding with website to be verified is [A, B], the second information vector corresponding with reference site For [C, D].

Wherein it is possible to calculate the distance between two vectors value by various modes, typically, the cosine folder of the two is calculated The mode at angle, and using calculated distance value as the diversity factor factor.

380, judge whether the diversity factor factor meets given threshold condition, if so, executing 390.Otherwise, it executes 3100。

Wherein, if the diversity factor factor is comentropy difference, if the diversity factor factor meets setting Threshold condition, it is determined that the website to be verified is that abnormal website can specifically include:

If the comentropy difference for setting quantity is more than given threshold, and/or information corresponding with setting key content domain Entropy difference is more than given threshold, it is determined that the website to be verified is abnormal website；Or

If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, really The fixed website to be verified is abnormal website.

If the diversity factor factor is the distance value, if the diversity factor factor meets given threshold item Part, it is determined that the website to be verified is that abnormal website can specifically include:

If the distance value is more than setting threshold value, it is determined that the website to be verified is abnormal website.

390, determine the website to be verified for abnormal website.

3100, determine that the website to be verified is normal website.

After comentropy of the technical solution of the present embodiment by content domain each in website to be verified is calculated, obtain The comentropy of each content domain, the comentropy based on the two calculate in reliable website similar with the website data feature to be verified The diversity factor factor of the two is obtained, and then carries out the technological means of anomalous identification to website to be verified, may be implemented according to exception Comentropy difference between website and reliable website, simply, quickly identifies the technical effect of abnormal website, recognition accuracy Height, real-time are good.

Example IV

Fig. 4 is a kind of structure chart of the identification device for website that the embodiment of the present invention four provides.As shown in figure 4, the dress Set includes: that history updates page acquisition module 41, content domain acquisition module 42, content domain comentropy computing module 43 and exception Identification module 44, in which:

History updates page acquisition module 41, associated at least with website to be verified in the set time period, obtaining Two history update the page.

Content domain obtains module 42, carries out Context resolution for updating the page to each history, obtains and described goes through with each At least one corresponding content domain of history refresh page face.

Content domain comentropy computing module 43, for updating the content in the page in identical content domain according to each history Variation calculates the comentropy of each content domain.

Anomalous identification module 44, for carrying out anomalous identification to the website to be verified according to comentropy calculated result.

The embodiment of the present invention is updated by the set time period, obtaining with associated at least two history in website to be verified The page；The page is updated to each history and carries out Context resolution, obtain it is corresponding with each history refresh page face at least one Content domain；The content change in the page in identical content domain is updated according to each history, calculates the information of each content domain Entropy；According to comentropy calculated result, the technological means of anomalous identification is carried out to the website to be verified, due to Information Entropy Features Discrimination is good, calculates simple, timeliness height, it is not high, real to can solve discrimination brought by the identification technology of existing cheating website The technical issues of when property difference and the artificial mark for needing introducing additional or data preparation work, optimizes existing website and knows Other technology, improves the recognition accuracy of abnormal website.

On the basis of the various embodiments described above, at least two history associated with website to be verified updates the page can be with Include:

At least two history corresponding with the website domain name of the website to be verified updates the page；And/or

At least two history corresponding with the same web page address in the website to be verified updates the page.

On the basis of the various embodiments described above, the history updates page acquisition module, specifically can be used for:

In the set time period, it is grabbed by web crawlers newly generated in network, and/or has the page of update；

After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as described to be verified Website；

According to the page for including in the clustering cluster, obtains and updated with associated at least two history in website to be verified The page.

On the basis of the various embodiments described above, the content domain may include at least one of following:

Text header domain, picture domain, picture header domain, picture describe textview field.

On the basis of the various embodiments described above, the content domain comentropy computing module specifically can be used for:

Respectively in the same target content domain that each history updates the page, extracts at least one and compare object；

According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, described in calculating Compare the probability of occurrence of object；

According to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.

On the basis of the various embodiments described above, if the content in the object content domain includes text, the comparison Object may include: urtext, semantic signature or semantic classes；

If the content in the object content domain includes picture, the comparison object may include: original image or Person's picture classification.

On the basis of the various embodiments described above, it can also include: body matter association comparison module, be used for:

In the frequency of occurrence according to the comparison object in each object content domain, going out for the comparison object is calculated Before existing probability, if it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, It obtains respectively and compares the associated body matter of object with described；

If updated in the page in different history, it is not identical that the corresponding body matter of object is compared with same target, then will It is different comparison objects that the target, which compares object tag,.

On the basis of the various embodiments described above, the anomalous identification module be can specifically include:

Reference site acquiring unit obtains in reliable website list for the data characteristics according to the website to be verified It takes and the associated reference site in website to be verified；

Reference site comentropy acquiring unit, for obtaining the letter of at least one content domain corresponding with the reference site Cease entropy；

Key content domain selection unit, for choosing at least one in the website to be verified and the reference site A key content domain；

Diversity factor factor calculating unit, for according in the website to be verified and the reference site, with the pass The corresponding comentropy of key content domain calculates the diversity factor factor between the website to be verified and the reference site；

Abnormal website identifies subelement, if meeting given threshold condition for the diversity factor factor, it is determined that described Website to be verified is abnormal website.

On the basis of the various embodiments described above, diversity factor factor calculating unit specifically can be used for:

In the website to be verified and the reference site, it is poor to obtain comentropy corresponding with same key content domain Value is used as the diversity factor factor；

Abnormal website identification subelement specifically can be used for:

It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor；

The exception website identification subelement specifically can be used for:

On the basis of the various embodiments described above, the data characteristics of the website to be verified may include at least one of following:

The new added pages quantity and content topic in network upgrade frequency, set period of time in set period of time.

Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other Abnormal website identification feature value is combined, and carries out anomalous identification to the website to be verified.

The identification device of website provided by the embodiment of the present invention can be used for executing the net of any embodiment of that present invention offer The recognition methods stood has corresponding functional module, realizes identical beneficial effect.

Obviously, it will be understood by those skilled in the art that each module of the above invention or each step can be by as above The server implementation.Optionally, the embodiment of the present invention can be realized with the program that computer installation can be performed, so as to It is executed by a processor with being stored in storage device, the program can store in a kind of computer-readable storage In medium, storage medium mentioned above can be read-only memory, disk or CD etc.；Or they are fabricated to each A integrated circuit modules, or single integrated circuit module is maked multiple modules or steps in them to realize.In this way, The present invention is not limited to the combinations of any specific hardware and software.

The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal Replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of recognition methods of website characterized by comprising

The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one Hold domain；

The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain；

According to comentropy calculated result, anomalous identification is carried out to the website to be verified；

Wherein, the content change in the page in identical content domain is updated according to each history, calculates the letter of each content domain Ceasing entropy includes:

According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, the comparison is calculated The probability of occurrence of object；

2. the method according to claim 1, wherein described and associated at least two history in website to be verified is more New page includes:

3. method according to claim 1 or 2, which is characterized in that in the set time period, obtain and closed with website to be verified At least two history refresh page faces of connection include:

After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as the net to be verified It stands；

According to the page for including in the clustering cluster, obtain and the associated at least two history refresh page in website to be verified Face.

4. the method according to claim 1, wherein the content domain includes at least one of following:

5. according to the method described in claim 1, it is characterized by:

If the content in the object content domain includes text, the comparison object include: urtext, semantic signature or Person's semantic classes；

If the content in the object content domain includes picture, the comparison object includes: original image or picture category Not.

6. method according to claim 1 or 5, which is characterized in that according to the comparison object in each target Hold the frequency of occurrence in domain, before the probability of occurrence for calculating the comparison object, further includes:

If it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, obtain respectively It takes and compares the associated body matter of object with described；

If updated in the page in different history, it is not identical to compare the corresponding body matter of object with same target, then it will be described It is different comparison objects that target, which compares object tag,.

7. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified Carrying out anomalous identification includes:

According to the data characteristics of the website to be verified, obtained and the associated ginseng in website to be verified in reliable website list Examine website；

Obtain the comentropy of at least one content domain corresponding with the reference site；

In the website to be verified and the reference site, at least one key content domain is chosen；

According in the website to be verified and the reference site, comentropy corresponding with the key content domain is counted Calculate the diversity factor factor between the website to be verified and the reference site；

If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is abnormal website.

8. the method according to the description of claim 7 is characterized in that according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor specifically includes:

In the website to be verified and the reference site, obtains comentropy difference corresponding with same key content domain and make For the diversity factor factor；

If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is that abnormal website is specific Include:

If the comentropy difference for setting quantity is more than given threshold, and/or comentropy corresponding with key content domain is set is poor Value is more than given threshold, it is determined that the website to be verified is abnormal website；Or

If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, it is determined that institute Website to be verified is stated as abnormal website.

9. the method according to the description of claim 7 is characterized in that according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor specifically includes:

In the website to be verified, will corresponding at least two key content domains comentropy constitute the first information to Amount；

In the reference site, will corresponding at least two key content domain comentropy constitute the second information to Amount；

10. according to the described in any item methods of claim 7-9, which is characterized in that the data characteristics packet of the website to be verified It includes at least one of following:

11. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified Carrying out anomalous identification includes:

If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, it is determined that The website to be verified is abnormal website；Or

If the comentropy at least one object content domain corresponding with the website to be verified is less than the second threshold value of setting, Determine the website to be verified for abnormal website；Or

If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, really The fixed website to be verified is abnormal website.

12. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified Carrying out anomalous identification includes:

Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other exceptions Website identification feature value is combined, and carries out anomalous identification to the website to be verified.

13. a kind of identification device of website characterized by comprising

History updates page acquisition module, in the set time period, acquisition to be gone through with website to be verified associated at least two History updates the page；

Content domain obtains module, carries out Context resolution for updating the page to each history, obtains and update with each history At least one corresponding content domain of the page；

Content domain comentropy computing module, for updating the content change in the page in identical content domain according to each history, Calculate the comentropy of each content domain；

Anomalous identification module, for carrying out anomalous identification to the website to be verified according to comentropy calculated result；

Wherein, the content domain comentropy computing module, is specifically used for:

14. device according to claim 13, which is characterized in that the history updates page acquisition module, is specifically used for:

15. device according to claim 13, which is characterized in that further include: body matter is associated with comparison module, is used for:

In the frequency of occurrence according to the comparison object in each object content domain, it is general to calculate the appearance for comparing object Before rate, if it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, respectively It obtains and compares the associated body matter of object with described；

16. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:

Reference site acquiring unit, for the data characteristics according to the website to be verified, obtained in reliable website list with The associated reference site in website to be verified；

Reference site comentropy acquiring unit, for obtaining the information of at least one content domain corresponding with the reference site Entropy；

Key content domain selection unit, for choosing at least one pass in the website to be verified and the reference site Key content domain；

Diversity factor factor calculating unit, for according in the website to be verified and the reference site, in the key Hold the corresponding comentropy in domain, calculates the diversity factor factor between the website to be verified and the reference site；

Abnormal website identifies subelement, if meeting given threshold condition for the diversity factor factor, it is determined that described to be tested Demonstrate,proving website is abnormal website.

17. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:

18. device according to claim 13, which is characterized in that the anomalous identification module specifically includes: