CN110263283A - Website detection method and device - Google Patents

Website detection method and device Download PDF

Info

Publication number
CN110263283A
CN110263283A CN201910531749.1A CN201910531749A CN110263283A CN 110263283 A CN110263283 A CN 110263283A CN 201910531749 A CN201910531749 A CN 201910531749A CN 110263283 A CN110263283 A CN 110263283A
Authority
CN
China
Prior art keywords
website
webpage
seo
rule
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910531749.1A
Other languages
Chinese (zh)
Inventor
周坤朋
秦曼
韩佑波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZHENGZHOU XIZHI INFORMATION TECHNOLOGY Co Ltd
Original Assignee
ZHENGZHOU XIZHI INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZHENGZHOU XIZHI INFORMATION TECHNOLOGY Co Ltd filed Critical ZHENGZHOU XIZHI INFORMATION TECHNOLOGY Co Ltd
Priority to CN201910531749.1A priority Critical patent/CN110263283A/en
Publication of CN110263283A publication Critical patent/CN110263283A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

This application discloses a kind of website detection method and devices, this method comprises: obtaining the station address of website to be detected;According to the station address, the source code for each webpage for including in the website is successively crawled;According to preset a plurality of search engine optimization SEO rule, the source code of each webpage in the website is carried out abnormality detection, the abnormality detection result of the website is obtained, which includes the abnormal cause that SEO rule is not met in the abnormal webpage and the exception webpage for do not meet in the website SEO rule;Export the abnormality detection result.The scheme of the application may be implemented more quick, comprehensive and efficiently carry out SEO rule detection to website.

Description

Website detection method and device
Technical field
This application involves Website construction technical field more particularly to a kind of website detection method and devices.
Background technique
Website refers to that, on the internet according to certain rule, tool making is used to show specific content related web page Set.People can be issued by website oneself wants disclosed information, or utilizes a website to provide relevant network Service.
And the building of website needs to follow some rules, e.g., relatively conventional rule is search engine optimization (Search Engine Optimization, SEO) rule.Website is adjusted based on SEO rule is conducive to a raising purpose website related Ranking in search engine.However, many website websites are likely to not follow SEO rule well in building process, this Sample will lead to website there are problems that it is some need to improve, therefore, how more comprehensively, efficiently detect exist in website The problem for not meeting SEO rule is those skilled in the art's technical problem in the urgent need to address.
Summary of the invention
In view of this, this application provides a kind of website detection method and device, it is more quick, comprehensive and efficient to realize To website carry out SEO rule detection.
To achieve the above object, on the one hand, this application provides a kind of website detection methods, comprising:
Obtain the station address of website to be detected;
According to the station address, the source code for each webpage for including in the website is successively crawled;
According to preset a plurality of search engine optimization SEO rule, the source code of each webpage in the website is carried out different Often detection, obtains the abnormality detection result of the website, the abnormality detection result includes that the SEO is not met in the website The abnormal cause of the SEO rule is not met in the abnormal webpage of rule and the abnormal webpage;
Export the abnormality detection result.
Preferably, a plurality of SEO rule includes: regular and applicable suitable at least one the oneth SEO in webpage At least one the 2nd SEO rule between different web pages;
It is described according to preset a plurality of search engine optimization SEO rule, to the source code of each webpage in the website into Row abnormality detection, comprising:
According to be suitable for webpage at least one the oneth SEO rule, respectively to the source code of webpage each in webpage into Row abnormality detection obtains the abnormality detection result of each webpage in the website;
According to be suitable for different web pages between at least one the 2nd SEO rule, in the website between different web pages into Row abnormality detection, obtains in the website that there are the abnormal group of web of abnormal at least one between webpage and the abnormal group of web Abnormal cause, the exception group of web includes at least two abnormal webpages.
Preferably, at least one the 2nd SEO rule includes: repeated pages detected rule;
It is described according to be suitable for different web pages between at least one the 2nd SEO rule, in the website different web pages it Between carry out abnormality detection, comprising:
In response to the repeated pages detected rule, the textual data of each webpage of the website is extracted respectively;
For each webpage in the website, the textual data based on the webpage calculates the local sensitivity fingerprint of the webpage;
It is counted respectively for each webpage in the website according to the local sensitivity fingerprint of each webpage in the website The Hamming distances of other webpages in the webpage and website are calculated, and the determining Hamming distances with the webpage are less than given threshold at least One webpage, the webpage and at least one described webpage are determined as one group, and there are the duplicate abnormal group of web of content.
Preferably, described according to the station address, the source code for each webpage for including in the website is successively crawled, Include:
According to the station address, the source code of the homepage of the website is crawled;
At least one link for including in the source code of the homepage is extracted, and the link extracted is cached to set of links In conjunction;
For every link not processed in the link set, the webpage in the website is grabbed according to the link Source code;
The link for including in the source code of the webpage is extracted, and the link extracted is cached to the link and is gathered In;
If there is not yet processed link in the link set, return described in executing for every link, foundation The link grabs the operation of the source code of the webpage in the website, until there is no not yet processed in link set Link, obtains the source code for each webpage for including in the website.
Preferably, every link in the link set, grabs the net in the website according to the link The source code of page, comprising:
Currently pending Object linking is determined from the link set;
From distributed reptile, determines and be suitble to the target crawler in the Object linking;
Pass through the source code of webpage pointed by Object linking described in the target crawler capturing.
Preferably, before the output abnormality detection result, further includes:
According to the abnormal cause for not meeting the SEO rule in search engine optimization SEO rule and the abnormal webpage, really Determine the prioritization scheme of exception webpage described in website;
While the output abnormality detection result, further includes: export exception webpage described in the website Prioritization scheme.
Preferably, the station address for obtaining website to be detected, comprising:
Obtain the domain name of the website to be detected of user's input;
Based on the domain name of the website, the uniform resource position mark URL of the website is determined.
Another aspect, present invention also provides a kind of website detection devices, comprising:
Address acquisition unit, for obtaining the station address of website to be detected;
Code crawls unit, for successively crawling each webpage for including in the website according to the station address Source code;
Abnormality detecting unit, for regular according to preset a plurality of search engine optimization SEO, to each net in the website The source code of page carries out abnormality detection, and obtains the abnormality detection result of the website, the abnormality detection result includes the net It stands and does not meet the abnormal cause of the SEO rule in the abnormal webpage and the abnormal webpage for do not meet the SEO rule;
As a result output unit, for exporting the abnormality detection result.
Preferably, a plurality of SEO rule in abnormality detection rule includes: suitable at least one the first in webpage SEO is regular and suitable at least one the 2nd SEO rule between different web pages;
The abnormality detecting unit, comprising:
First abnormality detecting unit, for regular according at least one the oneth SEO being suitable in webpage, respectively to webpage In the source code of each webpage carry out abnormality detection, obtain the abnormality detection result of each webpage in the website;
Second abnormality detecting unit, for regular according at least one the 2nd SEO being suitable between different web pages, described It is carried out abnormality detection between different web pages in website, obtains in the website that there are the abnormal webpages of abnormal at least one between webpage The abnormal cause of group and the abnormal group of web, the exception group of web include at least two abnormal webpages.
Preferably, at least one the 2nd SEO rule includes: repeated pages detected rule;
Second abnormality detecting unit, comprising:
Text extraction unit, for extracting each net of the website respectively in response to the repeated pages detected rule The textual data of page;
Fingerprint calculation unit, for for each webpage in the website, the textual data based on the webpage to calculate the net The local sensitivity fingerprint of page;
Repetition detection unit, for the office for each webpage in the website, according to each webpage in the website Portion's sensitivity fingerprint calculates separately the Hamming distances of other webpages in the webpage and website, and the determining Hamming distances with the webpage Less than at least one webpage of given threshold, the webpage and at least one described webpage are determined as one group, and there are content repetitions Abnormal group of web.
It can be seen via above technical scheme that in the embodiment of the present application, with getting the website of website to be detected After location, the source code for all webpages for including in the website can be successively crawled out, and combine preset a plurality of SEO rule to net Each webpage carries out abnormality detection in standing, it is seen then that the application can be disposably abnormal to all websites being related in website Detection realizes the abnormal webpage and abnormal cause that SEO rule is not met in each webpage for disposably detecting website, and It is prompted to user, so as to cover each webpage in website comprehensively by the one-time detection to website, is more comprehensively detected The abnormal conditions of SEO rule are not met in website out.
Moreover, it is different from a kind of SEO rule is only detected every time, webpage each in website is detected in the application When, it can be carried out abnormality detection according to preset a plurality of SEO, in this way, may be implemented by the one-time detection to website to a variety of SEO The abnormality detection of rule avoids by repeatedly submitting website detection and realizes the detection of a variety of SEO rules, is conducive to more complete Face efficiently detects the abnormal conditions that SEO rule is not met in website.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 shows a kind of a kind of flow diagram of website detection method of the application;
Fig. 2 shows a kind of another flow diagrams of website detection method of the application;
Fig. 3 shows a kind of another flow diagram of website detection method of the application;
Fig. 4 shows a kind of a kind of composed structure schematic diagram of website detection device of the application.
Specific embodiment
The scheme of the application is suitable for carrying out the website of building the detection of SEO rule, to detect not being inconsistent in website The abnormal conditions of SEO rule are closed, for rational modification or to optimize the website and provide foundation.
It describes in detail below with reference to flow chart.
As shown in Figure 1, it illustrates a kind of a kind of flow diagram of website detection method of the application, the side of the present embodiment Method can be applied to the website detection platform detected for website, the website detection platform can for a computer equipment or The set of multiple stage computers equipment, e.g., distributed computing system or distributed server including multiple servers etc..This The method of embodiment may include:
S101 obtains the station address of website to be detected.
Wherein, website to be detected is the website for currently needing to carry out SEO rule detection.The station address of the website is used for It is directed toward the webpage of the website.
It is understood that the station address that user detects needed for can submitting according to actual needs to website detection platform Etc. information.
Optionally, it is contemplated that domain name is more convenient for remembering for user, in the embodiment of the present application, needs in user to some When website is detected, the domain name of website to be detected can be submitted to website detection platform.Correspondingly, the website detection platform The domain name of the website to be detected of available user's input, and the domain name based on the website determine that the unified resource of website is fixed Position symbol (UniformResource Locator, URL).
S102 successively crawls the source code for each webpage for including in the website according to the station address.
Wherein, website generally can all have multiple webpages to constitute, in order to carry out the inspection of comprehensive SEO rule to website It surveys, then needs to crawl out in the website source code for all webpages for including.
It is understood that can have linking relationship between multiple webpages of website, e.g., may include in the homepage of website One or more is linked to the link (being referred to as chained address) of next stage webpage, and is linked to the next stage of the homepage After webpage, which can also be linked to next stage webpage again, and so on.Correspondingly, the application can be based on The station address successively crawls the source code of each webpage in website and analyzes the link in the source code of each webpage, and The source code of all webpages in the website is finally crawled out by continuous iteration.
Such as, in one possible implementation, the source generation of the homepage of the website can be crawled according to the station address Code, and at least one link for including in the source code of the homepage is extracted, and the link extracted is cached in link set. Then, for every link not processed in the link set, the source generation of the webpage in the website is grabbed according to the link Code, and extract the link for including in the source code of the webpage.If extracting link in the source code of webpage, will extract Link be cached to link set in.If link set in exist not yet processed link, for link set not by Every link of processing, continues to execute the operation that the source code of the webpage in the website is grabbed according to the link, until obtaining Not processed link is not present in link set, to obtain the source code of each webpage for including in webpage.
It is understood that since the webpage quantity in website is generally more, often crawl out the source generation an of webpage After code, the source code of the webpage can be cached, while continuing to crawl the source code of other webpages.
Wherein, the source code of webpage can be crawled by crawler in this application, and extract chain present in source code It connects.Wherein, crawler be it is a kind of according to certain rules, automatically grab the program or script of web message.Optionally, it is The efficiency that crawls of source code is improved, the application can crawl the source code of each webpage in website using distributed reptile.
S103 carries out the source code of each webpage in the website according to preset a plurality of search engine optimization SEO rule Abnormality detection obtains the abnormality detection result of the website.
Wherein, SEO rule is also referred to as SEO and builds a station rule.
Preset a plurality of SEO rule can be set as needed, and may include detection webpage in a plurality of SEO rule e.g. The label of middle source code name or format etc. whether satisfactory rule, in source code needed for web page contents and layout SEO rule etc..
It is understood that compared with a SEO rule is separately configured, since the website detection platform of the application is prefixed A plurality of SEO rule, therefore, after user initiates the SEO detection to some websites, which can be disposably to this Website carries out a plurality of SEO rule detection, repeatedly submits the different SEO for website by complicated operation without user The detection of rule improves the convenience and high efficiency of website detection.
In the embodiment of the present application, which may include the abnormal net that SEO rule is not met in the website The abnormal cause of SEO rule is not met in page and abnormal webpage.Such as, it is assumed that the life of the web page tag of some webpage in website Name does not meet the dependency rule of the SEO rule about web page tag name, then the webpage belongs to abnormal webpage, and abnormal speech is Web page tag is undesirable.
S104 exports the abnormality detection result.
Wherein, output abnormality testing result can be used for the place for prompting not meeting SEO rule in the user website, thus Be conducive to user and improve or optimize the website in time.
Optionally, before step S104, the application can be combined with not meeting in SEO rule and abnormal webpage described The abnormal cause of SEO rule, determines the prioritization scheme of exception webpage described in website.Such as, it is assumed that webpage does not meet SEO rule Voice is that the format of web page tag is undesirable, then can be given according to the requirement in SEO rule for web page tag format Prioritization scheme is adjustment web page tag format out, and (format of required satisfaction can be with for the format met needed for web page tag format It is obtained according to SEO rule).
Correspondingly, the optimization side of abnormal webpage in the website can also be exported while exporting the abnormality detection result Case, so that the improved procedure for the determination webpage that user can be highly efficient.
By the scheme of the application it is found that after the station address for getting website to be detected, can successively crawl out should The source code for all webpages for including in website, and it is abnormal to combine preset a plurality of SEO rule to carry out webpage each in website Detection, it is seen then that the application can realize disposable detection disposably to the equal abnormality detection in all websites being related in website The abnormal webpage and abnormal cause of SEO rule are not met in each webpage of website out, and is prompted to user, so as to logical It crosses and each webpage in website is covered comprehensively to the one-time detection of website, more comprehensively detect not meeting SEO rule in website Abnormal conditions.
Moreover, it is different from a kind of SEO rule is only detected every time, webpage each in website is detected in the application When, it can be carried out abnormality detection according to preset a plurality of SEO, in this way, may be implemented by the one-time detection to website to a variety of SEO The abnormality detection of rule avoids by repeatedly submitting website detection and realizes the detection of a variety of SEO rules, is conducive to more complete Face quickly detects the abnormal conditions that SEO rule is not met in website.
It is understood that in view of SEO rule not only includes the detected rule for single webpage, it is also possible to can wrap Include the detected rule between multiple webpages, therefore, a plurality of SEO rule of the application may include: suitable in webpage at least One article of the first SEO is regular and suitable at least one the 2nd SEO rule between different web pages.It wherein, will for the ease of distinguishing It is known as the first SEO rule suitable for the SEO rule in single webpage, and the SEO that will be suitable for detecting between different web pages is regular Referred to as the 2nd SEO rule.
It, can be according at least one the first correspondingly, when the source code of each webpage carries out abnormality detection in website SEO rule, respectively carries out abnormality detection the source code of webpage each in webpage, obtains the abnormality detection of each webpage in website As a result.It wherein, is individually to be carried out abnormality detection to each webpage respectively according to the first SEO rule, to analyze each webpage itself Whether belong to and do not meet for the rule of SEO as defined in the format of webpage itself or layout etc., and determines in the presence of abnormal net Page and abnormal cause.
In addition, the application can also be according at least one the 2nd SEO rule being suitable between different web pages, in the website not It is carried out abnormality detection between webpage, obtains in the website there are the abnormal group of web of abnormal at least one between webpage and this is different The abnormal cause of normal group of web.Wherein, which includes at least two abnormal webpages.
Wherein, the 2nd SEO rule is for the detected rule between two or more webpages, with detection two or more Whether meet the 2nd SEO rule between a webpage, if not depositing the 2nd SEO rule between two or more webpages, can incite somebody to action The two or multiple webpages are determined as one and there is abnormal abnormal group of web.For example, not met between webpage 1 and webpage 2 Certain the 2nd SEO rule, then the webpage 1 and webpage 2 belong to an abnormal group of web, and also do not meet between webpage 3 and webpage 4 Certain the 2nd SEO rule, then the webpage 3 and webpage 4 belong to the webpage in another abnormal group of web.
It optionally, may include: repeated pages detected rule at least one the 2nd SEO rule, repeated pages inspection Gauge is then used to detect whether the content multiplicity between two or more webpages to be more than threshold value, if it exceeds the threshold, then recognizing Multiplicity between two or more webpages is excessively high, does not meet SEO rule.
In order to make it easy to understand, below by taking the 2nd SEO detected rule is repeated pages detected rule as an example, to the net of the application Detection method of standing is introduced.Such as, referring to fig. 2, it illustrates a kind of signals of another process of website detection method of the application Figure, the method for the present embodiment may include:
S201 obtains the domain name for the website to be detected that user submits.
S202, the domain name based on the website determine the URL of website.
The present embodiment is to illustrate for a kind of implementation for obtaining the station address of website to be detected, for its other party Formula is also suitable.
S203 successively crawls the source code for each webpage for including in the website according to the station address.
S204 carries out abnormal inspection to the source code of webpage each in webpage respectively according at least one the oneth SEO rule It surveys, obtains the abnormality detection result of each webpage in website.
Wherein, which may include the abnormal cause in the presence of abnormal abnormal webpage and abnormal webpage.
S205 extracts each net of website based on the repeated pages detected rule at least one the 2nd SEO rule respectively The textual data of page.
Wherein, the textual data of webpage can be extracted from the source code of webpage, and specific extracting mode is with no restrictions.
Optionally, this step is also possible to extract the core textual data for the web page core content for belonging to setting in webpage.
S206, for each webpage in the website, the textual data based on the webpage, the local sensitivity for calculating the webpage refers to Line.
Wherein.The local sensitivity fingerprint of webpage is exactly the simhash fingerprint being commonly called as, and is referred to as local sensitivity Hash.
S207 is counted for each webpage in the website according to the local sensitivity fingerprint of each webpage in the website respectively The Hamming distances of other webpages in the webpage and website are calculated, and the determining Hamming distances with the webpage are less than given threshold at least The webpage and at least one webpage are determined as one group there are the duplicate abnormal group of web of content, are deposited by one webpage Hold at least one duplicate abnormal group of web inside.
It is understood that for any two webpage, if according to the respective local sensitivity fingerprint meter of the two webpages The Hamming distances for calculating the two webpages are less than given threshold, then illustrate that the content multiplicity of the two webpages is higher, to recognize SEO rule is not met for the multiplicity of the two webpages.
Based on this, for each webpage in website, need respectively to compare the webpage with other webpages in webpage It is right, to find the higher two or more webpages of mutual multiplicity, and higher two or more of multiplicity between each other A webpage is one group of exception group of web.
S208 exports the abnormality detection result of webpage in website and there are the letters of at least one duplicate group of web of content Breath.
It is understood that in the embodiment of the present application, the source code of each webpage can be grabbed by continuous iteration Present in link the source codes of other pointed webpages and can be crawled in this process in conjunction with distributed reptile to improve Efficiency.In order to make it easy to understand, may refer to Fig. 3, it illustrates a kind of signals of another process of website detection method of the application Figure, the method for the present embodiment may include:
S301 obtains the domain name for the website to be detected that user submits.
S302, the domain name based on the website determine the URL of website.
The present embodiment is to illustrate for a kind of implementation for obtaining the station address of website to be detected, for its other party Formula is also suitable.
S303 crawls the source code of the homepage of the website according to the URL of the website.
It is understood that the homepage for the website that the URL of website is directed toward, therefore, the URL based on the website can grab first Get the source code of the homepage of website.
S304 extracts at least one link for including in the source code of the homepage, and the link extracted is cached to chain It connects in set.
In order to which successively iteration stores the link in the original code of each webpage, then can will be extracted each webpage from website Link out is cached in a link set.
S305 determines currently pending Object linking from link set.
Wherein, for the ease of distinguishing, the link not yet crawled by crawler in link set is known as Object linking.
S306 is determined from distributed reptile and is suitble to the target crawler in the Object linking.
It is understood that the distributed reptile can be by multiple crawler module compositions, which can root It is the link that each crawler module assignment is to be crawled in distributed reptile according to load balancing principle.Based on this, can determine wait locate Manage the target crawler of the Object linking.
S307 by the source code of webpage pointed by the target crawler capturing Object linking, and source code is stored To code memory block.
Wherein, by source code storage code memory block be can selection operation, its purpose is to it is subsequent convenient for centralized management The respective source code of all webpages of the website grabbed, to carry out SEO rule detection to webpage one by one.
S308 extracts the link for including in the source code of the webpage, and the link extracted is cached in link set.
It is linked it is understood that being based on one in crawler, after the source code for crawling webpage pointed by the link, Due to that still may include the chained address for linking other webpages in the source code of the webpage, it is still necessary to from the webpage Link is extracted in source code.Meanwhile in order to crawl out the source code of each layer webpage of website, in extracting source code Storage is also required to after link into link set, to continue the source generation for analysing whether to have the webpage being not yet crawled out Code.
S309 is detected and be there is not yet processed link in the link set, if it is, return step S305;If It is to obtain the source code for each webpage for including and execute step S310 in the webpage stored in code memory block.
It is understood that crawling if there is link in link set and distribute to crawler, illustrate not deposit in website In the chained address for the webpage being not yet mined, in that case, then it can be confirmed that the code memory block stores the website In all webpages source code.
S310 carries out the source code of each webpage in the website according to preset a plurality of search engine optimization SEO rule Abnormality detection obtains the abnormality detection result of the website.
S311, output abnormality testing result.
Step S310 and S311 may refer to the related introduction of preceding embodiment, and details are not described herein.
A kind of website detection method of corresponding the application, present invention also provides a kind of website detection devices.
Such as, referring to fig. 4, it illustrates a kind of a kind of composed structure schematic diagram of website detection device of the application, this implementations Example device may include:
Address acquisition unit 401, for obtaining the station address of website to be detected;
Code crawls unit 402, for successively crawling each webpage for including in the website according to the station address Source code;
Abnormality detecting unit 403, for regular according to preset a plurality of search engine optimization SEO, to each in the website The source code of a webpage carries out abnormality detection, and obtains the abnormality detection result of the website, and the abnormality detection result includes institute It states the abnormal webpage for not meeting the SEO rule in website and does not meet the exception of the SEO rule in the abnormal webpage Reason;
As a result output unit 404, for exporting the abnormality detection result.
In one possible implementation, a plurality of SEO rule in abnormality detection rule includes: suitable in webpage At least one the oneth SEO rule and suitable between different web pages at least one the 2nd SEO rule;
The abnormality detecting unit, comprising:
First abnormality detecting unit, for regular according at least one the oneth SEO being suitable in webpage, respectively to webpage In the source code of each webpage carry out abnormality detection, obtain the abnormality detection result of each webpage in the website;
Second abnormality detecting unit, for regular according at least one the 2nd SEO being suitable between different web pages, described It is carried out abnormality detection between different web pages in website, obtains in the website that there are the abnormal webpages of abnormal at least one between webpage The abnormal cause of group and the abnormal group of web, the exception group of web include at least two abnormal webpages.
Optionally, at least one the 2nd SEO rule includes: repeated pages detected rule;
Second abnormality detecting unit, comprising:
Text extraction unit, for extracting each net of the website respectively in response to the repeated pages detected rule The textual data of page;
Fingerprint calculation unit, for for each webpage in the website, the textual data based on the webpage to calculate the net The local sensitivity fingerprint of page;
Repetition detection unit, for the office for each webpage in the website, according to each webpage in the website Portion's sensitivity fingerprint calculates separately the Hamming distances of other webpages in the webpage and website, and the determining Hamming distances with the webpage Less than at least one webpage of given threshold, the webpage and at least one described webpage are determined as one group, and there are content repetitions Abnormal group of web.
In another possible implementation, the code crawls unit, comprising:
First crawls unit, for crawling the source code of the homepage of the website according to the station address;
First link extraction unit extracts at least one link for including in the source code of the homepage, and will extract Link be cached to link set in;
Second crawls unit, for grabbing the website according to the link for every link in the link set In webpage source code;
Second links extraction unit, the link for including in the source code for extracting the webpage, and the chain that will be extracted It connects and is cached in the link set;
Termination unit is crawled, if returning and executing for there is not yet processed link in link set It is described that the operation of the source code of the webpage in the website is grabbed according to the link for every link, until obtaining the net The source code for each webpage for including in page.
Optionally, described second unit is crawled, comprising:
It links and determines subelement, for determining currently pending Object linking from the link set;
Crawler distributes subelement, for determining and being suitble to the target crawler in the Object linking from distributed reptile;
Second crawls subelement, for the source generation by webpage pointed by Object linking described in the target crawler capturing Code.
Optionally, in the embodiment of the application apparatus above, which can also include:
Optimize determination unit, for before the result output unit exports the abnormality detection result, foundation to be searched for The abnormal cause that the SEO rule is not met in engine optimization SEO rule and the abnormal webpage, determines abnormal described in website The prioritization scheme of webpage;
The optimization output unit is used for while the result output unit exports the abnormality detection result, defeated The prioritization scheme of exception webpage described in the website out.
Optionally, address acquisition unit in the device, comprising:
Domain Name acquisition unit, the domain name of the website to be detected for obtaining user's input;
Address conversioning unit determines the uniform resource position mark URL of the website for the domain name based on the website.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.
The foregoing description of the disclosed embodiments can be realized those skilled in the art or using the present invention.To this A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and the general principles defined herein can Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited It is formed on the embodiments shown herein, and is to fit to consistent with the principles and novel features disclosed in this article widest Range.
The above is only the preferred embodiment of the present invention, it is noted that those skilled in the art are come It says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims (10)

1. a kind of website detection method characterized by comprising
Obtain the station address of website to be detected;
According to the station address, the source code for each webpage for including in the website is successively crawled;
According to preset a plurality of search engine optimization SEO rule, abnormal inspection is carried out to the source code of each webpage in the website It surveys, obtains the abnormality detection result of the website, the abnormality detection result includes not meeting the SEO rule in the website Abnormal webpage and the abnormal webpage in do not meet the abnormal cause of the SEO rule;
Export the abnormality detection result.
2. website detection method according to claim 1, which is characterized in that a plurality of SEO rule includes: suitable for net At least one the oneth SEO in page is regular and suitable at least one the 2nd SEO rule between different web pages;
It is described regular according to preset a plurality of search engine optimization SEO, the source code of each webpage in the website is carried out different Often detection, comprising:
According at least one the oneth SEO rule being suitable in webpage, the source code of webpage each in webpage is carried out respectively different Often detection, obtains the abnormality detection result of each webpage in the website;
According to be suitable for different web pages between at least one the 2nd SEO rule, carried out between different web pages in the website it is different Often detection, obtains in the website that there are the different of the abnormal group of web of abnormal at least one between webpage and the abnormal group of web Normal reason, the exception group of web include at least two abnormal webpages.
3. website detection method according to claim 2, which is characterized in that at least one the 2nd SEO rule includes: Repeated pages detected rule;
It is described according to be suitable for different web pages between at least one the 2nd SEO rule, in the website between different web pages into Row abnormality detection, comprising:
In response to the repeated pages detected rule, the textual data of each webpage of the website is extracted respectively;
For each webpage in the website, the textual data based on the webpage calculates the local sensitivity fingerprint of the webpage;
This is calculated separately according to the local sensitivity fingerprint of each webpage in the website for each webpage in the website The Hamming distances of other webpages in webpage and website, and the determining Hamming distances with the webpage are less than at least one of given threshold Webpage, the webpage and at least one described webpage are determined as one group, and there are the duplicate abnormal group of web of content.
4. website detection method according to claim 1, which is characterized in that it is described according to the station address, successively climb Take the source code for each webpage for including in the website, comprising:
According to the station address, the source code of the homepage of the website is crawled;
At least one link for including in the source code of the homepage is extracted, and the link extracted is cached to link set In;
For every link not processed in the link set, the source generation of the webpage in the website is grabbed according to the link Code;
The link for including in the source code of the webpage is extracted, and the link extracted is cached in the link set;
If there is not yet processed link in the link set, return for every link described in executing, according to the chain The operation for grabbing the source code of the webpage in the website is connect, until not yet processed chain is not present in link set It connects, obtains the source code for each webpage for including in the website.
5. website detection method according to claim 4, which is characterized in that every in the link set Link, the source code of the webpage in the website is grabbed according to the link, comprising:
Currently pending Object linking is determined from the link set;
From distributed reptile, determines and be suitble to the target crawler in the Object linking;
Pass through the source code of webpage pointed by Object linking described in the target crawler capturing.
6. website detection method according to claim 1, which is characterized in that the output abnormality detection result it Before, further includes:
According to the abnormal cause for not meeting the SEO rule in search engine optimization SEO rule and the abnormal webpage, net is determined The prioritization scheme of exception webpage described in standing;
While the output abnormality detection result, further includes: export the optimization of exception webpage described in the website Scheme.
7. website detection method according to claim 1, which is characterized in that the website for obtaining website to be detected Location, comprising:
Obtain the domain name of the website to be detected of user's input;
Based on the domain name of the website, the uniform resource position mark URL of the website is determined.
8. a kind of website detection device characterized by comprising
Address acquisition unit, for obtaining the station address of website to be detected;
Code crawls unit, for successively crawling the source generation for each webpage for including in the website according to the station address Code;
Abnormality detecting unit, for regular according to preset a plurality of search engine optimization SEO, to each webpage in the website Source code carries out abnormality detection, and obtains the abnormality detection result of the website, and the abnormality detection result includes in the website The abnormal cause that the SEO rule is not met in the abnormal webpage and the abnormal webpage of the SEO rule is not met;
As a result output unit, for exporting the abnormality detection result.
9. website detection device according to claim 8, which is characterized in that a plurality of SEO in the abnormality detection rule Rule includes: suitable at least one the oneth SEO rule in webpage and suitable at least one the second between different web pages SEO rule;
The abnormality detecting unit, comprising:
First abnormality detecting unit, for regular according at least one the oneth SEO being suitable in webpage, respectively to every in webpage The source code of a webpage carries out abnormality detection, and obtains the abnormality detection result of each webpage in the website;
Second abnormality detecting unit, for regular according at least one the 2nd SEO being suitable between different web pages, in the website Carried out abnormality detection between middle different web pages, obtain in the website there are the abnormal group of web of abnormal at least one between webpage with And the abnormal cause of the abnormal group of web, the exception group of web include at least two abnormal webpages.
10. website detection device according to claim 9, which is characterized in that at least one the 2nd SEO rule packet It includes: repeated pages detected rule;
Second abnormality detecting unit, comprising:
Text extraction unit, for extracting each webpage of the website respectively in response to the repeated pages detected rule Textual data;
Fingerprint calculation unit, for for each webpage in the website, the textual data based on the webpage to calculate the webpage Local sensitivity fingerprint;
Repetition detection unit, for for each webpage in the website, the part according to each webpage in the website to be quick Feel fingerprint, calculates separately the Hamming distances of other webpages in the webpage and website, and the determining Hamming distances with the webpage are less than The webpage and at least one described webpage are determined as one group there are contents and is duplicate different by least one webpage of given threshold Normal group of web.
CN201910531749.1A 2019-06-19 2019-06-19 Website detection method and device Pending CN110263283A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910531749.1A CN110263283A (en) 2019-06-19 2019-06-19 Website detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910531749.1A CN110263283A (en) 2019-06-19 2019-06-19 Website detection method and device

Publications (1)

Publication Number Publication Date
CN110263283A true CN110263283A (en) 2019-09-20

Family

ID=67919405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910531749.1A Pending CN110263283A (en) 2019-06-19 2019-06-19 Website detection method and device

Country Status (1)

Country Link
CN (1) CN110263283A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929257A (en) * 2019-10-30 2020-03-27 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage
CN112532469A (en) * 2020-10-27 2021-03-19 深圳市牛商网络股份有限公司 Website detection method, system, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114864A1 (en) * 2008-11-06 2010-05-06 Leedor Agam Method and system for search engine optimization
WO2011040981A1 (en) * 2009-10-02 2011-04-07 David Drai System and method for search engine optimization
CN104572787A (en) * 2013-10-29 2015-04-29 腾讯科技(深圳)有限公司 Method and device for recognizing pseudo original website
CN106339372A (en) * 2015-07-06 2017-01-18 阿里巴巴集团控股有限公司 Search engine optimization method and device
CN107807937A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 A kind of website SEO processing methods, apparatus and system
CN109104421A (en) * 2018-08-01 2018-12-28 深信服科技股份有限公司 A kind of web site contents altering detecting method, device, equipment and readable storage medium storing program for executing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114864A1 (en) * 2008-11-06 2010-05-06 Leedor Agam Method and system for search engine optimization
WO2011040981A1 (en) * 2009-10-02 2011-04-07 David Drai System and method for search engine optimization
CN104572787A (en) * 2013-10-29 2015-04-29 腾讯科技(深圳)有限公司 Method and device for recognizing pseudo original website
CN106339372A (en) * 2015-07-06 2017-01-18 阿里巴巴集团控股有限公司 Search engine optimization method and device
CN107807937A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 A kind of website SEO processing methods, apparatus and system
CN109104421A (en) * 2018-08-01 2018-12-28 深信服科技股份有限公司 A kind of web site contents altering detecting method, device, equipment and readable storage medium storing program for executing

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SEO网站优化: ""SEO新人须知:如何解决避免网站内容重复问题及如何检查"", 《HTTPS://WWW.OH100.COM/PEIXUN/SEO/24938.HTML》 *
信安消防检测: ""网站SEO诊断方法"", 《HTTPS://JINGYAN.BAIDU.COM/ARTICLE/84B4F56595878860F6DA32AF.HTML》 *
王君泽: "《网络舆情应对的关键技术研究》", 31 January 2017, 华中科技大学出版社 *
边馥苓等: "《时空大数据的技术与方法》", 31 May 2016, 北京:测绘出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929257A (en) * 2019-10-30 2020-03-27 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage
CN110929257B (en) * 2019-10-30 2022-02-01 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage
CN112532469A (en) * 2020-10-27 2021-03-19 深圳市牛商网络股份有限公司 Website detection method, system, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Lakshmi et al. Efficient prediction of phishing websites using supervised learning algorithms
CN102073725B (en) Method for searching structured data and search engine system for implementing same
CN107679211A (en) Method and apparatus for pushed information
CN102722498B (en) Search engine and implementation method thereof
CN103955842B (en) A kind of online advertisement commending system and method towards mass media data
CN102073726B (en) Structured data import method and device for search engine system
CN102722501B (en) Search engine and realization method thereof
CN106874253A (en) Recognize the method and device of sensitive information
CN102790762A (en) Phishing website detection method based on uniform resource locator (URL) classification
CN106776567B (en) Internet big data analysis and extraction method and system
CN107341399A (en) Assess the method and device of code file security
CN102739679A (en) URL(Uniform Resource Locator) classification-based phishing website detection method
CN107092639A (en) A kind of search engine system
CN108108288A (en) A kind of daily record data analytic method, device and equipment
CN103942268B (en) Search for method, equipment and the application interface being combined with application
CN105718559B (en) Search forms pages and the method and apparatus of target pages transforming relationship
CN107908615A (en) A kind of method and apparatus for obtaining search term corresponding goods classification
CN103077254B (en) Webpage acquisition methods and device
CN106779278A (en) The evaluation system of assets information and its treating method and apparatus of information
CN104268289B (en) The abatement detecting method and device of link URL
CN107102993A (en) A kind of user's demand analysis method and device
CN107480277A (en) Method and device for web log file collection
CN103177036A (en) Method and system for label automatic extraction
CN102117331B (en) Video search method and system
CN102375813A (en) Duplicate detection system and method for search engines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190920

RJ01 Rejection of invention patent application after publication