CN107229631B - Method and device for capturing website data - Google Patents

Method and device for capturing website data Download PDF

Info

Publication number
CN107229631B
CN107229631B CN201610171622.XA CN201610171622A CN107229631B CN 107229631 B CN107229631 B CN 107229631B CN 201610171622 A CN201610171622 A CN 201610171622A CN 107229631 B CN107229631 B CN 107229631B
Authority
CN
China
Prior art keywords
website
score
webpage
code
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610171622.XA
Other languages
Chinese (zh)
Other versions
CN107229631A (en
Inventor
朱德伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201610171622.XA priority Critical patent/CN107229631B/en
Publication of CN107229631A publication Critical patent/CN107229631A/en
Application granted granted Critical
Publication of CN107229631B publication Critical patent/CN107229631B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a device for capturing website data, which can capture websites according to the code quality of the websites, thereby filtering out some websites with poor code quality, reducing the workload of a web crawler, avoiding the waste of time on some websites with low code quality when a client searches, and improving the use experience of the user to a certain extent. The method for capturing the website data comprises the following steps: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; and capturing the data of the website according to the capturing probability of the website.

Description

Method and device for capturing website data
Technical Field
The invention relates to the technical field of computers and software thereof, in particular to a method and a device for capturing website data.
Background
A web crawler (also called a web spider, web robot) is a program or script that automatically crawls the world wide web according to certain rules. The webpage crawling strategies can be divided into three types of depth-first, breadth-first and best-first, and meanwhile, a special algorithm is provided for webpage weight determination, such as PageRank, namely webpage ranking, also called webpage level, Google left-side ranking or PageRank, which is a link analysis algorithm proposed by Google initiatives larry-pagei and scherga-bulin in 1997 for constructing early search system prototypes, and the algorithm also becomes a calculation model which is very interesting for other search engines and academic circles since Google has obtained unprecedented success commercially.
At present, many important link analysis algorithms are derived on the basis of the PageRank algorithm. The PageRank algorithm is a method used by Google to identify the rank/importance of web pages and is the only criterion used by Google to measure the quality of a web site. After all other factors such as Title identification and Keywords identification are kneaded, Google adjusts the results through PageRank, so that the website ranking of the more "level/importance" webpages in the search results is improved, and the relevance and quality of the search results are improved. The PageRank algorithm gets levels from 0 to 10, with 10 being full. A higher PR value indicates a more popular (more important) web page, the higher the probability that the web page will be crawled. For example: a web site with a PR value of 1 indicates that the web site is less popular, while a PR value of 7 to 10 indicates that the web site is very popular (or extremely important). The PR value reaches 4, so that the website is good. Google sets the PR value of its own website to 10, which indicates that Google's website is very popular and important.
The PageRank algorithm is used by the conventional web crawler when the web crawler grabs a web page, namely the importance of the web page is calculated according to the algorithm, and as long as the PR value of the web page meets the requirement, the data of the web site can be grabbed, so that the workload of the web crawler is increased to a certain extent, the time of a client is wasted due to huge web site data, and the use experience of the client is further reduced.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for capturing website data, which can capture websites according to the code quality of the websites, so as to filter out some websites with poor code quality, thereby reducing the workload of a web crawler, further avoiding time waste of some websites with low code quality when a client searches, and improving the user experience to a certain extent.
To achieve the above object, according to one aspect of the present invention, a method for crawling website data is provided.
The method for capturing the website data comprises the following steps: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; and capturing the data of the website according to the capturing probability of the website.
Optionally, the step of determining the code quality of the web page includes: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.
Optionally, the web page of the website includes a first page of the website and a set number of second pages of the website; the step of determining the crawling probability of the website according to the code quality of the webpage comprises the following steps: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.
Optionally, the step of capturing the data of the website according to the capturing probability of the website includes: firstly, determining that the capturing probability of the website is not less than the lower limit value of the preset capturing probability, and then capturing data of the website.
According to another aspect of the invention, a device for crawling website data is provided.
The device for capturing website data comprises: the acquisition module is used for acquiring a webpage of a website and then determining the code quality of the webpage; the determining module is used for determining the capturing probability of the website according to the code quality of the webpage; and the grabbing module is used for grabbing the data of the website according to the grabbing probability of the website.
Optionally, the obtaining module is further configured to: firstly, the score corresponding to each mode is determined according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.
Optionally, the web page of the website includes a first page of the website and a set number of second pages of the website; the determination module is further to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.
Optionally, the capturing module is further configured to first determine that the capturing probability of the website is not less than a preset lower limit of the capturing probability, and then capture data of the website.
According to another aspect of the invention, an apparatus for crawling website data is provided.
The invention relates to a device for capturing website data, which comprises: a memory and a processor, wherein the memory stores instructions; the processor executing the instructions to: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; and capturing the data of the website according to the capturing probability of the website.
Optionally, the processor is further configured to: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.
Optionally, the web page of the website includes a first page of the website and a set number of second pages of the website; the processor is further configured to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.
According to still another aspect of embodiments of the present invention, there is provided an electronic apparatus including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for capturing the website data provided by the invention.
According to still another aspect of the embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method for crawling website data provided by the present invention.
According to the technical scheme of the invention, as the capturing probability of the website is obtained by analyzing the code quality of the website, websites with poor code quality can be filtered, so that the workload of a web crawler is reduced, the waste of time on websites with low code quality is avoided when a client searches, and the use experience of the user is improved to a certain extent.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a diagram illustrating an apparatus for crawling website data according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a method for crawling website data according to an embodiment of the present invention;
fig. 3 is a schematic diagram of another apparatus for crawling website data according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of an apparatus for crawling website data according to an embodiment of the present invention. As shown in fig. 1, an apparatus 10 for capturing website data according to an embodiment of the present invention mainly includes an obtaining module 11, a determining module 12, and a capturing module 13; the acquisition module 11 is configured to acquire a web page of a website and then determine the code quality of the web page; the determining module 12 is configured to determine a crawling probability of the website according to the code quality of the webpage; the grabbing module 13 is configured to grab the data of the website according to the grabbing probability of the website; the web pages of the website comprise a first page of the website and a set number of second pages of the website.
The obtaining module 11 of the apparatus 10 for capturing website data according to the embodiment of the present invention may be further configured to: firstly, the score corresponding to each mode is determined according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, and determining a CSS quality score of the webpage by using a CSS code static inspection tool; counting the number of the tags which are not recommended to use in the html tags to obtain the tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.
The determining module 12 of the apparatus 10 for capturing website data according to the embodiment of the present invention may further be configured to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.
The capturing module 13 of the apparatus 10 for capturing website data according to the embodiment of the present invention may be further configured to first determine that the capturing probability of the website is not less than a preset lower limit of the capturing probability, and then capture the data of the website.
Fig. 2 is a schematic diagram of a method for crawling website data according to an embodiment of the present invention. As shown in fig. 2, the main implementation of the method is the apparatus 10 for crawling website data mentioned in fig. 1, and the method mainly includes steps S20 to S22.
Step S20: and acquiring the webpage of the website and determining the code quality of the webpage. In this step, first, a web page of a website is obtained, and then a score corresponding to each mode is determined according to one or more of the following modes:
determining a redundant code score for the web page using a redundant code inspection tool; the redundant code mentioned here refers to code segments which are unnecessary in the code of the web page, and the redundant code can be checked by plug-ins such as a repeated code checking tool Simian, a Codestyle or a findbug, so as to obtain a redundant code score of the web page; for example, it can be set that every n rows of redundant codes, the score of the redundant code of the web page is added by 1; wherein n is more than or equal to 1.
Counting repeated keywords to obtain a repetition score of the webpage; the Meta tag is a description of a web page, and generally, a poor web page is included in a page description Meta content for a search engine to record, so that a score of the repetition degree of the web page can be determined by counting the repetition times of keywords in the page description Meta content; for example, it may be set that every time a keyword is repeated n times, the repetition score of the web page is increased by 1 point; wherein n is more than or equal to 3.
Checking the reference library version of the webpage to determine the score of the reference library version of the webpage; for example, it may be determined whether the version library referenced by the web page is lower than the set referenced version library by checking the version number of the referenced version library of the web page, and if the version library referenced by the web page is lower than the set referenced version library, the score of the referenced version of the web page is added by 1; if the score is not lower than the set reference version library, the score of the reference version of the webpage is unchanged; meanwhile, comparing the reference version of the webpage with a stable version library stored in advance, and if the reference version of the webpage does not belong to one of the stable version libraries, adding 1 to the score of the reference version library of the webpage; otherwise, the score of the reference version library of the webpage is unchanged.
Determining a code quality score for the web page using a code inspection tool; the code quality mentioned here refers to the problems existing in the code, and the number of the problems existing in the Javascript code of the website can be checked through a code checking tool such as JSCS and the like, so as to determine the code quality of the webpage; for example, the number of questions of the webpage code determined by the code inspection tool is set, and if the number of questions is larger than the upper limit value of the set number of questions, the code quality score of the webpage is added with 1 point; otherwise, the code quality of the webpage is unchanged.
Determining a CSS quality score for the web page using a CSS code static check tool; the CSS quality of a web page may be determined by examining the tags used in the web page code to determine a CSS quality score for the web page; for example, the number of times of using a tag that is not recommended by CSS in the web page code is checked, and if < tr > </tr > is used once, the CSS quality score of the web page is increased by 1 point; simultaneously checking the number of times that the CSS is written in the independent label, and adding 1 point to the CSS quality score of the webpage every time n times that the CSS is written in the independent label are checked; otherwise, the CSS quality score of the webpage is unchanged; wherein n is more than or equal to 1.
Counting the number of the tags which are not recommended to use in the html tags to obtain the tag score of the webpage; many html tags include non-recommended tags, so the tag score of the webpage can be obtained by comparing the html tags with a pre-stored tag library which is not recommended to use, and if the non-recommended tags are used, the tag score of the webpage is added by 1 point; otherwise, the label score of the web page is unchanged.
And taking the sum of the scores as the code quality of the webpage.
Step S21: and determining the capturing probability of the website according to the code quality of the webpage. The web page referred to in step S20 includes the website 'S first page and a set number of the website' S second pages. The apparatus 10 for capturing website data through step S20 determines the quality of the captured website home page and the website secondary page; and further calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of the second level pages of the website), and the capturing probability of the website is finally calculated according to the following formula, wherein the capturing probability of the website is (the maximum value of the set score range-the web page quality average score)/the maximum value of the set score range. For example, the maximum value of the score range is set to be 100, and if the average quality of the web pages of the website is 50 scores, the crawling probability of the website can be calculated to be 0.5; if the average mass score of the web pages of the website is 20, the calculated crawling probability of the website is 0.8, that is, the lower the average mass score of the web pages of a website is, the higher the crawling probability of the website is.
Step S22: and capturing the data of the website according to the capturing probability of the website. In this step, the apparatus for capturing website data 10 captures website data based on the capturing probability of the website obtained in step S21; for example, it may be set that if the probability of a website to be crawled is less than the lower limit of the set crawling probability, the website will not be crawled; the lower limit of the crawling probability is set to 0.4, and if the crawling probability of the website obtained in step S21 is 0.35, the website is not crawled.
According to the technical scheme of the embodiment of the invention, the capturing probability of the website is obtained by analyzing the code quality of the website, so that websites with poor code quality can be filtered, the workload of a web crawler is reduced, the waste of time on websites with low code quality is avoided when a client searches, and the use experience of the user is improved to a certain extent.
Fig. 3 is a schematic diagram of another apparatus for crawling website data according to an embodiment of the present invention. As shown in fig. 3, the apparatus 30 for capturing website data of the present invention mainly includes a memory 31 and a processor 32; wherein the memory 31 stores instructions; the processor 32 executes the instructions to: acquiring a webpage of a website, and determining the code quality of the webpage; determining the capturing probability of the website according to the code quality of the webpage; capturing data of the website according to the capturing probability of the website; the web pages of the website comprise a first page of the website and a set number of second pages of the website
The processor 32 of the apparatus 30 for crawling website data of the present invention is further configured to: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and counting the number of tags which are not recommended to use in html tags to obtain a tag score of the webpage; and then taking the sum of the scores as the code quality of the webpage.
The processor 32 of the apparatus 30 for crawling website data of the present invention is further configured to: calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website); the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A method for crawling website data, comprising:
acquiring a webpage of a website, and determining the code quality of the webpage;
determining the capturing probability of the website according to the code quality of the webpage;
capturing data of the website according to the capturing probability of the website;
wherein the step of determining the code quality of the web page comprises:
firstly, determining the corresponding scores of the modes according to one or more of the following modes:
a redundant code check tool is used to determine a redundant code score for the web page,
counting the repeated keywords to obtain the repeated degree score of the webpage,
examining a reference library version of a web page to determine a reference library version score for the web page,
using a code inspection tool to determine a Javascript code quality score for the web page,
the CSS code static check tool is used to determine the CSS quality score for the web page,
comparing the html tag with a pre-stored tag library which is not recommended to use to obtain the tag score of the webpage;
and then taking the sum of the scores as the code quality of the webpage.
2. The method of claim 1,
the web pages of the website comprise a first page of the website and a set number of second pages of the website;
the step of determining the crawling probability of the website according to the code quality of the webpage comprises the following steps:
calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website);
the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.
3. The method of claim 1, wherein the step of crawling the data of the website according to the crawling probability of the website comprises:
firstly, determining that the capturing probability of the website is not less than the lower limit value of the preset capturing probability, and then capturing data of the website.
4. An apparatus for crawling website data, comprising:
the acquisition module is used for acquiring a webpage of a website and then determining the code quality of the webpage;
the determining module is used for determining the capturing probability of the website according to the code quality of the webpage;
the grabbing module is used for grabbing the data of the website according to the grabbing probability of the website; wherein the obtaining module is further configured to: firstly, determining the corresponding scores of the modes according to one or more of the following modes: determining a redundant code score of the webpage by using a redundant code inspection tool, counting repeated keywords to obtain a repetition score of the webpage, inspecting a reference library version of the webpage to determine a reference library version score of the webpage, determining a Javascript code quality score of the webpage by using a code inspection tool, determining a CSS quality score of the webpage by using a CSS code static inspection tool, and obtaining a label score of the webpage by comparing an html label with a pre-saved label library which is not recommended to use; and then taking the sum of the scores as the code quality of the webpage.
5. The apparatus of claim 4, wherein the web pages of the website comprise a first page of the website and a set number of second pages of the website; the determination module is further to:
calculating the average quality score of the web page according to the following formula: the web page quality average score is (the code quality score of the first page of the website + the code quality score of the second level page of the website)/(1 + the number of second level pages of the website);
the crawling probability of the website is calculated according to the following formula, wherein the crawling probability of the website is (the maximum value of the set score range-the quality average score of the webpage)/the maximum value of the set score range.
6. The apparatus of claim 4, wherein the crawling module is further configured to first determine that the crawling probability of the website is not less than a preset lower limit value of the crawling probability, and then crawl the data of the website.
7. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3.
8. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-3.
CN201610171622.XA 2016-03-24 2016-03-24 Method and device for capturing website data Active CN107229631B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610171622.XA CN107229631B (en) 2016-03-24 2016-03-24 Method and device for capturing website data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610171622.XA CN107229631B (en) 2016-03-24 2016-03-24 Method and device for capturing website data

Publications (2)

Publication Number Publication Date
CN107229631A CN107229631A (en) 2017-10-03
CN107229631B true CN107229631B (en) 2020-11-03

Family

ID=59932133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610171622.XA Active CN107229631B (en) 2016-03-24 2016-03-24 Method and device for capturing website data

Country Status (1)

Country Link
CN (1) CN107229631B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388669A (en) * 2018-03-19 2018-08-10 四川意高汇智科技有限公司 Distributed computing method for data mining
CN114925308B (en) * 2022-04-29 2023-10-03 北京百度网讯科技有限公司 Webpage processing method and device of website, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116638A (en) * 2013-02-19 2013-05-22 人民搜索网络股份公司 Webpage screening method and device thereof
CN104063310A (en) * 2013-03-22 2014-09-24 阿里巴巴集团控股有限公司 WEB front end quality detection method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102111385A (en) * 2009-12-28 2011-06-29 北京安码科技有限公司 Webpage security trust scoring method
CN103631806A (en) * 2012-08-24 2014-03-12 华为技术有限公司 Network information fetching method and device
CN104133830A (en) * 2013-05-02 2014-11-05 乐视网信息技术(北京)股份有限公司 Data obtaining method
CN103399918B (en) * 2013-07-31 2016-08-17 东北大学 A kind of method improving the searched rate in website

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116638A (en) * 2013-02-19 2013-05-22 人民搜索网络股份公司 Webpage screening method and device thereof
CN104063310A (en) * 2013-03-22 2014-09-24 阿里巴巴集团控股有限公司 WEB front end quality detection method and device

Also Published As

Publication number Publication date
CN107229631A (en) 2017-10-03

Similar Documents

Publication Publication Date Title
Henzinger et al. Measuring index quality using random walks on the Web
CN103279710B (en) Method and system for detecting malicious codes of Internet information system
Chitraa et al. A novel technique for sessions identification in web usage mining preprocessing
US8417657B2 (en) Methods and apparatus for computing graph similarity via sequence similarity
CN102567407B (en) Method and system for collecting forum reply increment
US10621255B2 (en) Identifying equivalent links on a page
Gowda et al. Clustering web pages based on structure and style similarity (application paper)
Meschenmoser et al. Scraping scientific web repositories: challenges and solutions for automated content extraction
CN1834965A (en) Method and system for assessing quality of search engines
CN106230835B (en) Method based on Nginx log analysis and the IPTABLES anti-malicious access forwarded
CN107229631B (en) Method and device for capturing website data
US20150205769A1 (en) System and method for recognizing non-body text in webpage
WO2012129102A2 (en) Detection and analysis of backlink activity
US11108802B2 (en) Method of and system for identifying abnormal site visits
Oza et al. Elimination of noisy information from web pages
CN108574585B (en) System fault solution obtaining method and device
US20190121914A1 (en) Method for Automated Categorization of Keyword Data
KR101524618B1 (en) Apparatus for colleting of harmful sites and method thereof
CN113722572A (en) Distributed deep crawling method, device and medium
Doerfel et al. How social is social tagging?
JP6960274B2 (en) Data collection equipment, data collection methods, and programs
Kapusta et al. Analysis of differences between expected and observed probability of accesses to web pages
Chandra et al. A Study on website quality evaluation based on sitemap
Srivastava et al. Implementation of web application for disease prediction using AI
US9898540B1 (en) Method for automated categorization of keyword data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant