CN110717112A

CN110717112A - Method for crawling social network data

Info

Publication number: CN110717112A
Application number: CN201911003989.0A
Authority: CN
Inventors: 陈修远; 潘琪
Original assignee: Shandong Health And Medical Big Data Co Ltd
Current assignee: Shandong Health And Medical Big Data Co Ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2020-01-21

Abstract

The invention discloses a method for crawling social network data, and belongs to the technical field of webpage crawlers. The method for crawling social network data comprises the steps of crawling a first webpage of a first network domain by utilizing a search engine of an online social network, wherein the first webpage comprises one or more links to second webpages, and the one or more second webpages are in the one or more second network domains; a domain rank is accessed for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain. The method for crawling the social network data can be used for crawling the social network data in an efficient and convenient mode so as to remarkably improve the speed of a crawler, and has good popularization and application values.

Description

Method for crawling social network data

Technical Field

The invention relates to the technical field of webpage crawlers, and particularly provides a method for crawling social network data.

Background

In the big data era, in the face of massive business data, how to effectively extract and utilize the information becomes a troublesome problem. The common general search engine has certain limitations, for example, the returned result contains many web pages which are not needed by the user, and the common general search engine has no effect on data which is dense in information content and has a certain structure, and cannot be well found and obtained.

In order to solve the problems, web crawlers for directionally capturing related webpage resources are produced. The web crawler is a program for automatically downloading web pages, and selectively accesses web pages and related links on the internet according to a set grabbing target to acquire required information. Focused crawlers do not pursue large coverage, but rather target crawling of web pages related to a particular subject matter content to prepare data resources for subject-oriented user queries. According to statistics, the number of global Web pages is hundreds of billions, and the number of URLs pointing to the same Web information increases in several levels, so that the Web crawler technology faces huge challenges: the enormous volume of Web information allows crawlers to download only a small number of Web pages in a given time.

Today, cloud computing and rapid development of big data are expected by developers, so that high-quality pages can be obtained as many as possible in unit time, and crawling efficiency is improved.

Disclosure of Invention

The technical task of the invention is to provide a method for crawling social network data, which can crawl social network data in an efficient and convenient manner so as to obviously improve the speed of a crawler, aiming at the existing problems.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of crawling social networking data, the method crawling a first web page of a first network domain using a search engine of an online social network, the first web page comprising one or more links to second web pages, the one or more second web pages being within the one or more second network domains; accessing a domain rank for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain; one or more second web pages are selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages, each selected second web page being crawled by a search engine of the online social network.

Preferably, the first web page is associated with a first network domain, the first web page including links respectively corresponding to the second web page.

Preferably, the links of the second web page are associated with one or more second network domains different from the first network domain of the first web page, and each second network domain of the links of the second web page may be different.

Preferably, each of the second web page domains is external to the online social network.

Preferably, each second network domain may be ranked, the domain ranking of each second network domain based at least on one or more domain quality signals associated with the second network domain.

Preferably, one or more of the domain quality signals may include a measure of activation of one or more social plugins of an online social network associated with one or more web pages of the second network domain.

Preferably, one or more second web pages can be selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages.

Preferably, the search engine may crawl each selected second web page.

Compared with the prior art, the method for crawling social network data has the following outstanding beneficial effects: the method for crawling the social network data is a method for capturing the social network data by using a web crawler facing a developer, and can be used for crawling the social network data in an efficient and convenient manner, so that the speed of the crawler is obviously improved, and the method has good popularization and application values.

Detailed Description

The method for crawling social network data of the present invention will be described in further detail with reference to the following embodiments.

Examples

The method for crawling social network data of the invention crawls a first webpage of a first network domain by utilizing a search engine of an online social network, wherein the first webpage comprises one or more links to second webpages, and the one or more second webpages are in one or more second network domains. The first web page is associated with a first network domain, the first web page including links respectively corresponding to the second web page. The links of the second web page are associated with one or more second network domains that are different from the first network domain of the first web page, and each second network domain of the links of the second web page may be different. Each second web page domain is external to the online social network.

The method for crawling social network data is associated with a social network system, can utilize data of the social network system, and can be operated by one or more servers of the social network system. The one or more servers may include one or more web crawlers.

In a particular embodiment, the first web page is associated with a first network domain. Further, the first web page may include links respectively corresponding to the second web page. The links of the second web page may be associated with one or more second network domains different from the first network domain of the first web page. Each second network domain of the second web page may be different, and the web crawler may send the second web page to a social-networking system, where it may determine whether each second web page should be crawled.

A domain rank is accessed for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain. Each second network domain may be ranked, the domain ranking of each second network domain based at least on one or more domain quality signals associated with the second network domain. One or more of the domain quality signals may include a measure of activation of one or more social plugins of an online social network associated with one or more web pages of the second network domain. One or more second web pages may be selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages. The search engine may crawl each selected second web page.

The social networking system may access the domain rankings for each second network domain using the domain rankings. Furthermore, a domain rank of each network domain (1.... k), i.e. the second network domain, can be determined with a value (1.... k) corresponding to a signal, i.e. a domain quality signal, of each network domain (1.... k). For example, the social networking system may access a domain rank for each second network domain associated with the second web page. One of the signals (1.... k) may correspond to one or more social plugins of a social networking system associated with one or more web pages of each network domain (1.... k). For each second network domain, the domain ranking may be based at least on historical ranking data associated with the second network domain. For example, the social networking system may access historical groveling hit data to take as input previous title web pages for each second network domain for an ML algorithm to determine a domain rank for the second network domain.

One or more second web pages are selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages.

The social networking system may select one or more of the second web pages to crawl based at least in part on a domain rank of a second network domain associated with the second web page. For example, the domain rank of the second web page domain associated with second web page a and second web page B may be above a predetermined threshold. In this way, the second web page a and the second web page B may be associated with a high quality network domain. Thus, the social networking system may select a second web page a and a second web page B for further web crawling. For example, the domain rank of the second network domain associated with second web page C and second web page D may be below a predetermined threshold. In this way, second web page C and second web page D may be associated with low quality web page domains. Thus, the social networking system may not select the second web page C and the second web page D for further web crawling. In particular embodiments, the order in which the second web page is crawled may be determined by a domain rank of a second network domain with which the second web page is associated. For example, the second web page A may be associated with a higher domain rank than the second web page B, and thus, the second web page A may be crawled before the second web page B.

A search engine of the online social network crawls each selected second web page.

The social networking system may send the selected second web page a and second web page B to the web crawler for further web crawlers. In particular embodiments, a search engine of the social-networking system may send one or more web crawlers to one or more selected second web pages, where each web crawler receives one or more selected second web pages therefrom. Each web crawler may prioritize selected second web pages received from the search engine for web crawling. For example, a social networking system may include two web crawlers a and B associated with two servers, respectively. The search engine may send a second web page a and a second web page B, whose domain ranks are higher than a predetermined threshold, to the web crawler a and the web crawler B, respectively, for web crawling. The search engine can also respectively send the second webpage C and the second webpage D with the domain rank lower than the threshold value to the web crawler A and the web crawler B for web crawling. Thereafter, web crawler A and web crawler B may perform web crawling on second web page A and second web page B in preference to second web page C and second web page D. In particular embodiments, the search engine may index and access the content of each second web page for future retrieval in one or more search results of the search engine.

The method of crawling social network data of the present invention, when occurring in a particular order, contemplates any suitable steps of the method of crawling social network data occurring in any suitable order.

The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A method of crawling social network data, characterized by: the method crawls a first web page of a first network domain using a search engine of an online social network, the first web page including one or more links to second web pages, the one or more second web pages being within one or more second network domains; accessing a domain rank for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain; one or more second web pages are selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages, each selected second web page being crawled by a search engine of the online social network.

2. The method of crawling social networking data of claim 1, wherein: the first web page is associated with a first network domain, the first web page including links respectively corresponding to the second web page.

3. The method of crawling social networking data of claim 2, wherein: the links of the second web page are associated with one or more second network domains different from the first network domain of the first web page, and each second network domain of the links of the second web page may be different.

4. The method of crawling social networking data of claim 3, wherein: each second web page domain is external to the online social network.

5. The method of crawling social networking data of claim 4, wherein: each second network domain may be ranked, the domain ranking of each second network domain based at least on one or more domain quality signals associated with the second network domain.

6. The method of crawling social networking data of claim 5, wherein: one or more of the domain quality signals may include a measure of activation of one or more social plugins of an online social network associated with one or more web pages of the second network domain.

7. The method of crawling social networking data of claim 6, wherein: one or more second web pages may be selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages.

8. The method of crawling social networking data of claim 7, wherein: the search engine may crawl each selected second web page.