CN110717112A - Method for crawling social network data - Google Patents

Method for crawling social network data Download PDF

Info

Publication number
CN110717112A
CN110717112A CN201911003989.0A CN201911003989A CN110717112A CN 110717112 A CN110717112 A CN 110717112A CN 201911003989 A CN201911003989 A CN 201911003989A CN 110717112 A CN110717112 A CN 110717112A
Authority
CN
China
Prior art keywords
domain
network
web page
crawling
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911003989.0A
Other languages
Chinese (zh)
Inventor
陈修远
潘琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Health And Medical Big Data Co Ltd
Original Assignee
Shandong Health And Medical Big Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Health And Medical Big Data Co Ltd filed Critical Shandong Health And Medical Big Data Co Ltd
Priority to CN201911003989.0A priority Critical patent/CN110717112A/en
Publication of CN110717112A publication Critical patent/CN110717112A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Abstract

The invention discloses a method for crawling social network data, and belongs to the technical field of webpage crawlers. The method for crawling social network data comprises the steps of crawling a first webpage of a first network domain by utilizing a search engine of an online social network, wherein the first webpage comprises one or more links to second webpages, and the one or more second webpages are in the one or more second network domains; a domain rank is accessed for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain. The method for crawling the social network data can be used for crawling the social network data in an efficient and convenient mode so as to remarkably improve the speed of a crawler, and has good popularization and application values.

Description

Method for crawling social network data
Technical Field
The invention relates to the technical field of webpage crawlers, and particularly provides a method for crawling social network data.
Background
In the big data era, in the face of massive business data, how to effectively extract and utilize the information becomes a troublesome problem. The common general search engine has certain limitations, for example, the returned result contains many web pages which are not needed by the user, and the common general search engine has no effect on data which is dense in information content and has a certain structure, and cannot be well found and obtained.
In order to solve the problems, web crawlers for directionally capturing related webpage resources are produced. The web crawler is a program for automatically downloading web pages, and selectively accesses web pages and related links on the internet according to a set grabbing target to acquire required information. Focused crawlers do not pursue large coverage, but rather target crawling of web pages related to a particular subject matter content to prepare data resources for subject-oriented user queries. According to statistics, the number of global Web pages is hundreds of billions, and the number of URLs pointing to the same Web information increases in several levels, so that the Web crawler technology faces huge challenges: the enormous volume of Web information allows crawlers to download only a small number of Web pages in a given time.
Today, cloud computing and rapid development of big data are expected by developers, so that high-quality pages can be obtained as many as possible in unit time, and crawling efficiency is improved.
Disclosure of Invention
The technical task of the invention is to provide a method for crawling social network data, which can crawl social network data in an efficient and convenient manner so as to obviously improve the speed of a crawler, aiming at the existing problems.
In order to achieve the purpose, the invention provides the following technical scheme:
a method of crawling social networking data, the method crawling a first web page of a first network domain using a search engine of an online social network, the first web page comprising one or more links to second web pages, the one or more second web pages being within the one or more second network domains; accessing a domain rank for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain; one or more second web pages are selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages, each selected second web page being crawled by a search engine of the online social network.
Preferably, the first web page is associated with a first network domain, the first web page including links respectively corresponding to the second web page.
Preferably, the links of the second web page are associated with one or more second network domains different from the first network domain of the first web page, and each second network domain of the links of the second web page may be different.
Preferably, each of the second web page domains is external to the online social network.
Preferably, each second network domain may be ranked, the domain ranking of each second network domain based at least on one or more domain quality signals associated with the second network domain.
Preferably, one or more of the domain quality signals may include a measure of activation of one or more social plugins of an online social network associated with one or more web pages of the second network domain.
Preferably, one or more second web pages can be selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages.
Preferably, the search engine may crawl each selected second web page.
Compared with the prior art, the method for crawling social network data has the following outstanding beneficial effects: the method for crawling the social network data is a method for capturing the social network data by using a web crawler facing a developer, and can be used for crawling the social network data in an efficient and convenient manner, so that the speed of the crawler is obviously improved, and the method has good popularization and application values.
Detailed Description
The method for crawling social network data of the present invention will be described in further detail with reference to the following embodiments.
Examples
The method for crawling social network data of the invention crawls a first webpage of a first network domain by utilizing a search engine of an online social network, wherein the first webpage comprises one or more links to second webpages, and the one or more second webpages are in one or more second network domains. The first web page is associated with a first network domain, the first web page including links respectively corresponding to the second web page. The links of the second web page are associated with one or more second network domains that are different from the first network domain of the first web page, and each second network domain of the links of the second web page may be different. Each second web page domain is external to the online social network.
The method for crawling social network data is associated with a social network system, can utilize data of the social network system, and can be operated by one or more servers of the social network system. The one or more servers may include one or more web crawlers.
In a particular embodiment, the first web page is associated with a first network domain. Further, the first web page may include links respectively corresponding to the second web page. The links of the second web page may be associated with one or more second network domains different from the first network domain of the first web page. Each second network domain of the second web page may be different, and the web crawler may send the second web page to a social-networking system, where it may determine whether each second web page should be crawled.
A domain rank is accessed for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain. Each second network domain may be ranked, the domain ranking of each second network domain based at least on one or more domain quality signals associated with the second network domain. One or more of the domain quality signals may include a measure of activation of one or more social plugins of an online social network associated with one or more web pages of the second network domain. One or more second web pages may be selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages. The search engine may crawl each selected second web page.
The social networking system may access the domain rankings for each second network domain using the domain rankings. Furthermore, a domain rank of each network domain (1.... k), i.e. the second network domain, can be determined with a value (1.... k) corresponding to a signal, i.e. a domain quality signal, of each network domain (1.... k). For example, the social networking system may access a domain rank for each second network domain associated with the second web page. One of the signals (1.... k) may correspond to one or more social plugins of a social networking system associated with one or more web pages of each network domain (1.... k). For each second network domain, the domain ranking may be based at least on historical ranking data associated with the second network domain. For example, the social networking system may access historical groveling hit data to take as input previous title web pages for each second network domain for an ML algorithm to determine a domain rank for the second network domain.
One or more second web pages are selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages.
The social networking system may select one or more of the second web pages to crawl based at least in part on a domain rank of a second network domain associated with the second web page. For example, the domain rank of the second web page domain associated with second web page a and second web page B may be above a predetermined threshold. In this way, the second web page a and the second web page B may be associated with a high quality network domain. Thus, the social networking system may select a second web page a and a second web page B for further web crawling. For example, the domain rank of the second network domain associated with second web page C and second web page D may be below a predetermined threshold. In this way, second web page C and second web page D may be associated with low quality web page domains. Thus, the social networking system may not select the second web page C and the second web page D for further web crawling. In particular embodiments, the order in which the second web page is crawled may be determined by a domain rank of a second network domain with which the second web page is associated. For example, the second web page A may be associated with a higher domain rank than the second web page B, and thus, the second web page A may be crawled before the second web page B.
A search engine of the online social network crawls each selected second web page.
The social networking system may send the selected second web page a and second web page B to the web crawler for further web crawlers. In particular embodiments, a search engine of the social-networking system may send one or more web crawlers to one or more selected second web pages, where each web crawler receives one or more selected second web pages therefrom. Each web crawler may prioritize selected second web pages received from the search engine for web crawling. For example, a social networking system may include two web crawlers a and B associated with two servers, respectively. The search engine may send a second web page a and a second web page B, whose domain ranks are higher than a predetermined threshold, to the web crawler a and the web crawler B, respectively, for web crawling. The search engine can also respectively send the second webpage C and the second webpage D with the domain rank lower than the threshold value to the web crawler A and the web crawler B for web crawling. Thereafter, web crawler A and web crawler B may perform web crawling on second web page A and second web page B in preference to second web page C and second web page D. In particular embodiments, the search engine may index and access the content of each second web page for future retrieval in one or more search results of the search engine.
The method of crawling social network data of the present invention, when occurring in a particular order, contemplates any suitable steps of the method of crawling social network data occurring in any suitable order.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (8)

1. A method of crawling social network data, characterized by: the method crawls a first web page of a first network domain using a search engine of an online social network, the first web page including one or more links to second web pages, the one or more second web pages being within one or more second network domains; accessing a domain rank for each second network domain, the domain rank based, for each second network domain, on one or more domain quality signals associated with the second network domain, at least one domain quality signal comprising one or more social plugins to measure an online social network available on one or more webpages of the second network domain; one or more second web pages are selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages, each selected second web page being crawled by a search engine of the online social network.
2. The method of crawling social networking data of claim 1, wherein: the first web page is associated with a first network domain, the first web page including links respectively corresponding to the second web page.
3. The method of crawling social networking data of claim 2, wherein: the links of the second web page are associated with one or more second network domains different from the first network domain of the first web page, and each second network domain of the links of the second web page may be different.
4. The method of crawling social networking data of claim 3, wherein: each second web page domain is external to the online social network.
5. The method of crawling social networking data of claim 4, wherein: each second network domain may be ranked, the domain ranking of each second network domain based at least on one or more domain quality signals associated with the second network domain.
6. The method of crawling social networking data of claim 5, wherein: one or more of the domain quality signals may include a measure of activation of one or more social plugins of an online social network associated with one or more web pages of the second network domain.
7. The method of crawling social networking data of claim 6, wherein: one or more second web pages may be selected for crawling based at least in part on a domain rank of a second network domain associated with the second web pages.
8. The method of crawling social networking data of claim 7, wherein: the search engine may crawl each selected second web page.
CN201911003989.0A 2019-10-22 2019-10-22 Method for crawling social network data Pending CN110717112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911003989.0A CN110717112A (en) 2019-10-22 2019-10-22 Method for crawling social network data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911003989.0A CN110717112A (en) 2019-10-22 2019-10-22 Method for crawling social network data

Publications (1)

Publication Number Publication Date
CN110717112A true CN110717112A (en) 2020-01-21

Family

ID=69213037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911003989.0A Pending CN110717112A (en) 2019-10-22 2019-10-22 Method for crawling social network data

Country Status (1)

Country Link
CN (1) CN110717112A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160125082A1 (en) * 2014-11-05 2016-05-05 Facebook, Inc. Social-Based Optimization of Web Crawling for Online Social Networks
CN108376146A (en) * 2017-01-30 2018-08-07 苹果公司 Influence scoring based on domain

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160125082A1 (en) * 2014-11-05 2016-05-05 Facebook, Inc. Social-Based Optimization of Web Crawling for Online Social Networks
CN108376146A (en) * 2017-01-30 2018-08-07 苹果公司 Influence scoring based on domain

Similar Documents

Publication Publication Date Title
Sharma et al. A brief review on search engine optimization
US11580168B2 (en) Method and system for providing context based query suggestions
US11599499B1 (en) Third-party indexable text
KR101225467B1 (en) Propagating useful information among related web pages, such as web pages of a website
US7958111B2 (en) Ranking documents
US8666990B2 (en) System and method for determining authority ranking for contemporaneous content
EP1517250A1 (en) Improved systems and methods for ranking documents based upon structurally interrelated information
WO2009066140A2 (en) Federated search implemented across multiple search engines
US8438149B1 (en) Generating network pages for search engines
US20120143844A1 (en) Multi-level coverage for crawling selection
EP3022666A1 (en) Third party search applications for a search system
CN106776983B (en) Search engine optimization device and method
CN102541924B (en) A kind of caching method of retrieving information and search engine system
Alhaidari et al. User preference based weighted page ranking algorithm
WO2017003893A1 (en) Automatic grouping of browser bookmarks
CN102981903A (en) Method for process multiplexing in multi-core browser and multi-core browser of process multiplexing
US20210075809A1 (en) Method of and system for identifying abnormal site visits
Lyu et al. An efficient and packing-resilient two-phase android cloned application detection approach
CN110717112A (en) Method for crawling social network data
Chandra et al. Google search algorithm updates against web spam
Setayesh et al. Presentation of an Extended Version of the PageRank Algorithm to Rank Web Pages Inspired by Ant Colony Algorithm
KR101717063B1 (en) Web crawling apparatus and method
Wang et al. Anti-Crawler strategy and distributed crawler based on Hadoop
Somboonviwat et al. Simulation study of language specific web crawling
CN110011918B (en) Router-cooperation website security detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination