CN107229631B

CN107229631B - Method and device for capturing website data

Info

Publication number: CN107229631B
Application number: CN201610171622.XA
Authority: CN
Inventors: 朱德伟
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-03-24
Filing date: 2016-03-24
Publication date: 2020-11-03
Anticipated expiration: 2036-03-24
Also published as: CN107229631A

Abstract

The invention provides a method and a device for capturing website data, which can capture websites according to the code quality of the websites, thereby filtering out some websites with poor code quality, reducing the workload of a web crawler, avoiding the waste of time on some websites with low code quality when a client searches, and improving the use experience of the user to a certain extent. The method for capturing the website data comprises the following steps: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; and capturing the data of the website according to the capturing probability of the website.

Description

Method and device for capturing website data

Technical Field

The invention relates to the technical field of computers and software thereof, in particular to a method and a device for capturing website data.

Background

A web crawler (also called a web spider, web robot) is a program or script that automatically crawls the world wide web according to certain rules. The webpage crawling strategies can be divided into three types of depth-first, breadth-first and best-first, and meanwhile, a special algorithm is provided for webpage weight determination, such as PageRank, namely webpage ranking, also called webpage level, Google left-side ranking or PageRank, which is a link analysis algorithm proposed by Google initiatives larry-pagei and scherga-bulin in 1997 for constructing early search system prototypes, and the algorithm also becomes a calculation model which is very interesting for other search engines and academic circles since Google has obtained unprecedented success commercially.

At present, many important link analysis algorithms are derived on the basis of the PageRank algorithm. The PageRank algorithm is a method used by Google to identify the rank/importance of web pages and is the only criterion used by Google to measure the quality of a web site. After all other factors such as Title identification and Keywords identification are kneaded, Google adjusts the results through PageRank, so that the website ranking of the more "level/importance" webpages in the search results is improved, and the relevance and quality of the search results are improved. The PageRank algorithm gets levels from 0 to 10, with 10 being full. A higher PR value indicates a more popular (more important) web page, the higher the probability that the web page will be crawled. For example: a web site with a PR value of 1 indicates that the web site is less popular, while a PR value of 7 to 10 indicates that the web site is very popular (or extremely important). The PR value reaches 4, so that the website is good. Google sets the PR value of its own website to 10, which indicates that Google's website is very popular and important.

The PageRank algorithm is used by the conventional web crawler when the web crawler grabs a web page, namely the importance of the web page is calculated according to the algorithm, and as long as the PR value of the web page meets the requirement, the data of the web site can be grabbed, so that the workload of the web crawler is increased to a certain extent, the time of a client is wasted due to huge web site data, and the use experience of the client is further reduced.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for capturing website data, which can capture websites according to the code quality of the websites, so as to filter out some websites with poor code quality, thereby reducing the workload of a web crawler, further avoiding time waste of some websites with low code quality when a client searches, and improving the user experience to a certain extent.

To achieve the above object, according to one aspect of the present invention, a method for crawling website data is provided.

The method for capturing the website data comprises the following steps: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; and capturing the data of the website according to the capturing probability of the website.

Optionally, the step of determining the code quality of the web page includes: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.

Optionally, the web page of the website includes a first page of the website and a set number of second pages of the website; the step of determining the crawling probability of the website according to the code quality of the webpage comprises the following steps: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.

Optionally, the step of capturing the data of the website according to the capturing probability of the website includes: firstly, determining that the capturing probability of the website is not less than the lower limit value of the preset capturing probability, and then capturing data of the website.

According to another aspect of the invention, a device for crawling website data is provided.

The device for capturing website data comprises: the acquisition module is used for acquiring a webpage of a website and then determining the code quality of the webpage; the determining module is used for determining the capturing probability of the website according to the code quality of the webpage; and the grabbing module is used for grabbing the data of the website according to the grabbing probability of the website.

Optionally, the obtaining module is further configured to: firstly, the score corresponding to each mode is determined according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.

Optionally, the web page of the website includes a first page of the website and a set number of second pages of the website; the determination module is further to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.

Optionally, the capturing module is further configured to first determine that the capturing probability of the website is not less than a preset lower limit of the capturing probability, and then capture data of the website.

According to another aspect of the invention, an apparatus for crawling website data is provided.

The invention relates to a device for capturing website data, which comprises: a memory and a processor, wherein the memory stores instructions; the processor executing the instructions to: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; and capturing the data of the website according to the capturing probability of the website.

Optionally, the processor is further configured to: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.

Optionally, the web page of the website includes a first page of the website and a set number of second pages of the website; the processor is further configured to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.

According to still another aspect of embodiments of the present invention, there is provided an electronic apparatus including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for capturing the website data provided by the invention.

According to still another aspect of the embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method for crawling website data provided by the present invention.

According to the technical scheme of the invention, as the capturing probability of the website is obtained by analyzing the code quality of the website, websites with poor code quality can be filtered, so that the workload of a web crawler is reduced, the waste of time on websites with low code quality is avoided when a client searches, and the use experience of the user is improved to a certain extent.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a diagram illustrating an apparatus for crawling website data according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a method for crawling website data according to an embodiment of the present invention;

fig. 3 is a schematic diagram of another apparatus for crawling website data according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of an apparatus for crawling website data according to an embodiment of the present invention. As shown in fig. 1, an apparatus 10 for capturing website data according to an embodiment of the present invention mainly includes an obtaining module 11, a determining module 12, and a capturing module 13; the acquisition module 11 is configured to acquire a web page of a website and then determine the code quality of the web page; the determining module 12 is configured to determine a crawling probability of the website according to the code quality of the webpage; the grabbing module 13 is configured to grab the data of the website according to the grabbing probability of the website; the web pages of the website comprise a first page of the website and a set number of second pages of the website.

The obtaining module 11 of the apparatus 10 for capturing website data according to the embodiment of the present invention may be further configured to: firstly, the score corresponding to each mode is determined according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, and determining a CSS quality score of the webpage by using a CSS code static inspection tool; counting the number of the tags which are not recommended to use in the html tags to obtain the tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.

The determining module 12 of the apparatus 10 for capturing website data according to the embodiment of the present invention may further be configured to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.

The capturing module 13 of the apparatus 10 for capturing website data according to the embodiment of the present invention may be further configured to first determine that the capturing probability of the website is not less than a preset lower limit of the capturing probability, and then capture the data of the website.

Fig. 2 is a schematic diagram of a method for crawling website data according to an embodiment of the present invention. As shown in fig. 2, the main implementation of the method is the apparatus 10 for crawling website data mentioned in fig. 1, and the method mainly includes steps S20 to S22.

Step S20: and acquiring the webpage of the website and determining the code quality of the webpage. In this step, first, a web page of a website is obtained, and then a score corresponding to each mode is determined according to one or more of the following modes:

determining a redundant code score for the web page using a redundant code inspection tool; the redundant code mentioned here refers to code segments which are unnecessary in the code of the web page, and the redundant code can be checked by plug-ins such as a repeated code checking tool Simian, a Codestyle or a findbug, so as to obtain a redundant code score of the web page; for example, it can be set that every n rows of redundant codes, the score of the redundant code of the web page is added by 1; wherein n is more than or equal to 1.

Counting repeated keywords to obtain a repetition score of the webpage; the Meta tag is a description of a web page, and generally, a poor web page is included in a page description Meta content for a search engine to record, so that a score of the repetition degree of the web page can be determined by counting the repetition times of keywords in the page description Meta content; for example, it may be set that every time a keyword is repeated n times, the repetition score of the web page is increased by 1 point; wherein n is more than or equal to 3.

Checking the reference library version of the webpage to determine the score of the reference library version of the webpage; for example, it may be determined whether the version library referenced by the web page is lower than the set referenced version library by checking the version number of the referenced version library of the web page, and if the version library referenced by the web page is lower than the set referenced version library, the score of the referenced version of the web page is added by 1; if the score is not lower than the set reference version library, the score of the reference version of the webpage is unchanged; meanwhile, comparing the reference version of the webpage with a stable version library stored in advance, and if the reference version of the webpage does not belong to one of the stable version libraries, adding 1 to the score of the reference version library of the webpage; otherwise, the score of the reference version library of the webpage is unchanged.

Determining a code quality score for the web page using a code inspection tool; the code quality mentioned here refers to the problems existing in the code, and the number of the problems existing in the Javascript code of the website can be checked through a code checking tool such as JSCS and the like, so as to determine the code quality of the webpage; for example, the number of questions of the webpage code determined by the code inspection tool is set, and if the number of questions is larger than the upper limit value of the set number of questions, the code quality score of the webpage is added with 1 point; otherwise, the code quality of the webpage is unchanged.

Determining a CSS quality score for the web page using a CSS code static check tool; the CSS quality of a web page may be determined by examining the tags used in the web page code to determine a CSS quality score for the web page; for example, the number of times of using a tag that is not recommended by CSS in the web page code is checked, and if < tr > </tr > is used once, the CSS quality score of the web page is increased by 1 point; simultaneously checking the number of times that the CSS is written in the independent label, and adding 1 point to the CSS quality score of the webpage every time n times that the CSS is written in the independent label are checked; otherwise, the CSS quality score of the webpage is unchanged; wherein n is more than or equal to 1.

Counting the number of the tags which are not recommended to use in the html tags to obtain the tag score of the webpage; many html tags include non-recommended tags, so the tag score of the webpage can be obtained by comparing the html tags with a pre-stored tag library which is not recommended to use, and if the non-recommended tags are used, the tag score of the webpage is added by 1 point; otherwise, the label score of the web page is unchanged.

And taking the sum of the scores as the code quality of the webpage.

Step S21: and determining the capturing probability of the website according to the code quality of the webpage. The web page referred to in step S20 includes the website 'S first page and a set number of the website' S second pages. The apparatus 10 for capturing website data through step S20 determines the quality of the captured website home page and the website secondary page; and further calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of the second level pages of the website), and the capturing probability of the website is finally calculated according to the following formula, wherein the capturing probability of the website is (the maximum value of the set score range-the web page quality average score)/the maximum value of the set score range. For example, the maximum value of the score range is set to be 100, and if the average quality of the web pages of the website is 50 scores, the crawling probability of the website can be calculated to be 0.5; if the average mass score of the web pages of the website is 20, the calculated crawling probability of the website is 0.8, that is, the lower the average mass score of the web pages of a website is, the higher the crawling probability of the website is.

Step S22: and capturing the data of the website according to the capturing probability of the website. In this step, the apparatus for capturing website data 10 captures website data based on the capturing probability of the website obtained in step S21; for example, it may be set that if the probability of a website to be crawled is less than the lower limit of the set crawling probability, the website will not be crawled; the lower limit of the crawling probability is set to 0.4, and if the crawling probability of the website obtained in step S21 is 0.35, the website is not crawled.

According to the technical scheme of the embodiment of the invention, the capturing probability of the website is obtained by analyzing the code quality of the website, so that websites with poor code quality can be filtered, the workload of a web crawler is reduced, the waste of time on websites with low code quality is avoided when a client searches, and the use experience of the user is improved to a certain extent.

Fig. 3 is a schematic diagram of another apparatus for crawling website data according to an embodiment of the present invention. As shown in fig. 3, the apparatus 30 for capturing website data of the present invention mainly includes a memory 31 and a processor 32; wherein the memory 31 stores instructions; the processor 32 executes the instructions to: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; capturing data of the website according to the capturing probability of the website; the web pages of the website comprise a first page of the website and a set number of second pages of the website

The processor 32 of the apparatus 30 for crawling website data of the present invention is further configured to: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.

The processor 32 of the apparatus 30 for crawling website data of the present invention is further configured to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for crawling website data, comprising:

acquiring a webpage of a website, and determining the code quality of the webpage;

determining the capturing probability of the website according to the code quality of the webpage;

capturing data of the website according to the capturing probability of the website;

wherein the step of determining the code quality of the web page comprises:

firstly, determining the corresponding scores of the modes according to one or more of the following modes:

a redundant code check tool is used to determine a redundant code score for the web page,

counting the repeated keywords to obtain the repeated degree score of the webpage,

examining a reference library version of a web page to determine a reference library version score for the web page,

using a code inspection tool to determine a Javascript code quality score for the web page,

the CSS code static check tool is used to determine the CSS quality score for the web page,

comparing the html tag with a pre-stored tag library which is not recommended to use to obtain the tag score of the webpage;

and then taking the sum of the scores as the code quality of the webpage.

2. The method of claim 1,

the web pages of the website comprise a first page of the website and a set number of second pages of the website;

the step of determining the crawling probability of the website according to the code quality of the webpage comprises the following steps:

calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website);

the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.

3. The method of claim 1, wherein the step of crawling the data of the website according to the crawling probability of the website comprises:

firstly, determining that the capturing probability of the website is not less than the lower limit value of the preset capturing probability, and then capturing data of the website.

4. An apparatus for crawling website data, comprising:

the acquisition module is used for acquiring a webpage of a website and then determining the code quality of the webpage;

the determining module is used for determining the capturing probability of the website according to the code quality of the webpage;

the grabbing module is used for grabbing the data of the website according to the grabbing probability of the website; wherein the obtaining module is further configured to: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and obtaining a label score of the webpage by comparing an html label with a pre-saved label library which is not recommended to use; and then taking the sum of the scores as the code quality of the webpage.

5. The apparatus of claim 4, wherein the web pages of the website comprise a first page of the website and a set number of second pages of the website; the determination module is further to:

6. The apparatus of claim 4, wherein the crawling module is further configured to first determine that the crawling probability of the website is not less than a preset lower limit value of the crawling probability, and then crawl the data of the website.

7. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3.

8. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-3.