CN113656673A

CN113656673A - Master-slave distributed content crawling robot for advertisement delivery

Info

Publication number: CN113656673A
Application number: CN202110971084.3A
Authority: CN
Inventors: 刘文平
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-11-16

Abstract

The distributed content crawling robot based on Redis is designed and realized according to the actual grabbing requirement for advertisement putting, grabbing and storing are carried out in a distributed network deployment mode, firstly, according to the requirement of classified acquisition of an advertisement putting training set and a test set, a multi-thread distributed content crawling robot is provided, a multi-site classified acquisition rule based on a label is designed, a scheduling core with controllable task rate and balanced distribution and multi-thread pooling concurrent grabbing are realized based on a multi-task concurrent scheduling strategy, and automatic Web rule configuration and system state monitoring are realized; and secondly, a distributed content crawling robot deployment scheme is provided, and integrated test and grabbing result evaluation are performed. The distributed architecture designed by the invention can greatly improve the efficiency of the content crawling robot, effectively store specific data, meet the actual demand of advertisement putting, and has important practical significance and great application value.

Description

Master-slave distributed content crawling robot for advertisement delivery

Technical Field

The invention relates to a master-slave distributed content crawling robot, in particular to a master-slave distributed content crawling robot for advertisement delivery, and belongs to the technical field of content crawling robots.

Background

With the leap-type development of network technology, especially the arrival of the mobile internet greatly enriches the network data and information quantity, and the marketing significance and the advertising value of the internet are increasingly remarkable. Compared with the traditional media delivery mode, the Internet advertisement has unique advantages, firstly, the display materials are rich, multi-dimensional elements such as sound, pictures and characters can be organically fused together, and the price of the Internet advertisement is far lower than that of the traditional media advertisement with the same effect. Secondly, the network advertisement has strong interactivity, the advertisement putting effect can be conveniently counted while the distance between an advertiser and a user is shortened, in addition, the pertinence of the advertisement content is strong, the spreading range is wide, the space-time limitation is avoided, and the network advertisement has unique advantages.

The evolution of internet advertisements generally goes through three stages, advertisement delivery taking user behaviors as the leading factors becomes more and more important at present, and a delivery system can collect user behavior information and recommend advertisements in which the user is most interested to the user through means such as feature analysis. However, no matter how the advertisement delivery mode is developed, basic works such as collection, index updating and the like of mass data by an efficient network content crawling system cannot be left, the network content crawling is used as a core component of a search engine and is naturally an indispensable component in the advertisement delivery system, and the efficiency of the network content crawling system is directly related to the performance of the whole delivery system. Some web content crawling system architectures have been developed by large organizations such as Google, but these solutions only provide simple, non-customizable search interfaces for users.

In the prior art, a large number of network content crawling projects exist in the open source world, but a large part of the network content crawling systems are centralized, so that the information acquisition efficiency and the acquisition scale of the current explosive mass data acquisition are difficult to meet the actual application requirements, in addition, the operation mode of the centralized content crawling system is easily limited by hardware such as a memory and a processor and resources in the aspect of bandwidth, and once a fault occurs, the whole system is paralyzed. The distributed network content crawling adopts hardware resources and network resources brought by multiple machines, has remarkable speed and scale advantages compared with a centralized content crawling system, and can easily solve the bottleneck problem brought by system resources. However, most open-source content crawling programs are universal content crawling systems, customization is poor, and actual requirements of advertisement putting cannot be met, so that the open-source content crawling system has important practical significance and great application value for research and development of a distributed content crawling system for advertisement putting.

Google-bot is a basic crawler providing search service for Google, and generates a seed URL list by adopting data which is captured before and a site map submitted by a website host, extracts a webpage link from a seed to a capture queue, updates an old link, removes a dead link, and adds a new link to a Google indexer. However, Google-bots also face a great challenge, and capture thousands of web pages simultaneously, and continuously detect web page updates to automatically determine update intervals while distinguishing from stored web pages. The updating strategy is of great importance for an excellent content crawling robot system, and the efficiency of the content crawling robot is seriously influenced by repeatedly crawling unchanged webpages.

In summary, the content crawling system for advertisement delivery in the prior art has disadvantages, and the difficulties and problems to be solved in the present invention mainly focus on the following aspects:

firstly, most of the network content crawling systems in the prior art are universal content crawling systems, are poor in customization performance for advertisement delivery, cannot combine with internet advertisement delivery systems to process webpage information in a targeted manner, cannot combine with advertisement delivery requirements to index and manage webpage information, do not have the speed and scale advantages of the advertisement content crawling systems, cannot meet the actual requirements of advertisement delivery, only provide simple and non-customizable search interfaces for users, lack of expandability and cannot be used for advertisement delivery;

secondly, in the prior art, a large part of open-source network content crawling projects are centralized network content crawling systems, and in order to meet the actual application requirements of advertisement delivery, the current explosive mass data acquisition is faced, the information acquisition efficiency and the acquisition scale of the prior art are difficult to meet, and in addition, the operation mode of the centralized content crawling systems is easily limited by hardware such as a memory, a processor and the like and resources in the bandwidth aspect, once a fault occurs, the whole system falls into paralysis, so that the bottleneck problem caused by system resources cannot be solved;

third, the prior art lacks a distributed content crawling system designed for advertisement delivery requirements based on content relevance, cannot classify and collect pages to be delivered with advertisements and then use the pages as training sets, and also cannot provide a function of caching webpage information to be delivered with advertisements, lacks a real-time analysis capability for the pages of the advertisement delivery system, lacks a lightweight master-slave distributed content crawling architecture, lacks a URL balanced scheduling policy under multitask, a concurrent capturing policy of a crawler, and an updating policy of existing webpage information, cannot analyze, process and store the crawled webpage information in a distributed manner, and the content crawling system in the prior art is complex and difficult to deploy and is difficult to provide data support for the whole advertisement delivery platform.

With the continuous extension of network scale, the rapid popularization of mobile networks and the continuous emergence of new information, the information quantity on the internet becomes extremely huge and is updated frequently, so that the design and implementation of a content crawling system program meeting the advertising demand become extremely challenging, not only new web pages continuously appear, but also existing web pages are frequently updated, and the number of web pages updated every week accounts for more than four times of the whole network. The distributed network crawling is an innovative product combining the design idea of a distributed system and the crawling of network contents, and when a plurality of network crawlers are distributed in different address sections or different geographical positions of a network and work cooperatively, a distributed crawler is formed. And the task unit is used for scheduling and distributing different nodes to grab in parallel, and each node can fully utilize hardware and network resources of the node to complete a crawling task.

Disclosure of Invention

In order to solve the above problems, the present invention designs and implements a Redis-based distributed content crawling robot according to actual capturing requirements for advertisement delivery, and captures and stores the content in a distributed network deployment manner, specifically including: firstly, according to the requirements of classified collection of an advertisement putting training set and a test set, a multi-thread distributed network content crawling robot is provided, a multi-site classified collection rule based on a label is designed, a scheduling core with controllable task rate and balanced distribution and multi-thread pooling concurrent grabbing are realized based on a multi-task concurrent scheduling strategy, and automatic Web rule configuration and system state monitoring are realized; and secondly, a distributed content crawling robot deployment scheme is provided, and integrated test and grabbing result evaluation are performed. Fifteen capturing rules are formulated according to the distributed deployment scheme of the invention aiming at five categories in three portal websites, and simple analysis is carried out on the capturing results, so that the distributed architecture designed by the invention can greatly improve the efficiency of the content crawling robot, effectively stores specific data, can meet the actual demand of advertisement putting, and has important practical significance and great application value.

In order to realize the technical characteristics, the technical scheme adopted by the invention is as follows:

the master-slave distributed content crawling robot for advertisement delivery is designed to realize a distributed content crawling robot based on Redis, and is used for capturing and storing the content in a distributed network deployment mode; the overall architecture of the distributed content crawling robot comprises a hub scheduler, a plurality of crawler nodes, a Web configuration management module, an agent pool module, a distributed storage module, a Redis database and a MongoDB database, and specifically comprises the following steps:

firstly, based on the actual demand of advertisement delivery, a master-slave distributed content crawling robot is provided to index webpage information and periodically update, a training set and a test set are constructed, and the quick response of an advertisement delivery page is realized;

secondly, the content crawling robot consists of a central scheduler and a plurality of crawling crawlers, the crawling rule configuration and the real-time monitoring of the running state are realized based on a Web mode, a Redis memory database is adopted between the central scheduler and crawler nodes to realize two-way communication, the central scheduler adopts a multi-level URL queue to realize URL rule matching and duplicate removal operation, meanwhile, the regular rescheduling is carried out by periodically monitoring the change of the crawling rules, the scheduling rate of each regular queue is adjusted according to the configuration, and the task balancing is carried out among the plurality of crawling crawlers through a consistent Hash algorithm; in addition, each crawling crawler concurrently requests URLs by adopting a multi-thread pool design, page links are extracted firstly, metadata and text contents are extracted by adopting an open source Goose module, distributed storage is carried out on the contents by adopting a fragment and copy set mechanism, and meanwhile, a proxy IP is used for preventing a website from shielding the contents and crawling a robot.

The master-slave distributed content crawling robot for advertisement delivery comprises the following basic operation flows: firstly, starting a MongoDB distributed database and a Redis database, confirming normal operation of the MongoDB distributed database and the Redis database, then starting a Web configuration management module flash to monitor a local 5000 port, sequentially configuring and storing a capturing rule specified for a specific site into the Redis database, then starting a central hub scheduler, firstly loading a system configuration file to the global by the scheduler, loading a rule updating thread, reading a pre-established capturing rule from the Redis by the thread to a global dictionary and updating rule version information, creating a scheduling thread by the thread and starting the scheduling thread, periodically and dynamically monitoring a rule version number, activating a scheduling mark to inform the scheduling thread to immediately start new scheduling once when rule change is found, and carrying out two processes in one scheduling process by the scheduling thread: the method comprises the steps of rule initial scheduling and rule queue balanced scheduling, wherein the rule initial scheduling is driven by a rule seed list to perform updating scheduling once according to whether an updating period is reached, the rule queue scheduling determines the scheduling rate of each queue according to the priority and the weight of each queue in the configuration, then the scheduling state is removed and updated, and a consistent Hash process is put into a scheduling queue of a current survival crawler; after the agent pool is started, the agent pool is responsible for periodically obtaining the validity detection of the real-time agent IP information of the Internet and then throwing the effective agent IP information into Redis; after the crawler is started, loading a plurality of thread units, loading URLs from corresponding scheduling queues, delivering the URLs to a downloading component, extracting the URLs after downloading, feeding the URLs back to a scheduler, putting pages into a data queue to wait for processing of a storage module, finally starting the storage module, preprocessing the pages, extracting effective information, storing the effective information into a MongoDB of a distributed database, and monitoring the running state of a system through a FlaskWeb background; the components are independent from each other and can be deployed on different machine nodes, so that the effective utilization of resources is realized.

The master-slave distributed content crawling robot for advertisement delivery further comprises a central hub scheduler: the central hub scheduling class consists of a rule updating thread class and a rule scheduling thread class, the three classes work together depending on an environment class, the environment class provides global operation dictionary variables, stores real-time capturing rules and some global shared variable information, and also provides Redis database connection pool variables, and the work class defines two static methods:

the first method comprises the following steps: the link detection method carries out link state scheduling logic, executes actual scheduling operation simultaneously, creates a new URL object class for the link which does not appear, judges the current state of the existing URL object, counts the times of scheduling participation and scheduling failure according to a scheduling record table if the current state is a crawling failure state, abandons the capturing of the URL if the times exceed a set value, otherwise gives the opportunity of participating in scheduling again when the scheduling time interval is met, if the current state is not a failure, abandons the scheduling if the current state is crawling or scheduled, if the current state is crawling success and the type of the URL is a branch, the URL is scheduled immediately based on the condition that the content of the branch node changes fast, and schedules under the condition that the time interval is met under all the remaining conditions;

the second method comprises the following steps: and simultaneously updating the URL object information based on the hash duplicate removal function.

The master-slave distributed content crawling robot for advertisement delivery comprises a URL object, a master-slave distributed content crawling robot, a scheduling record table, a scheduling state and a crawling failure collection, wherein data items in the URL object comprise creation time, names and types of rules, a timestamp scheduled for the last time, the scheduling state is an initial state after URL creation and a new state when scheduling starts, a URL is changed into a scheduling state mark to enter a scheduling queue of a certain crawler, the scheduling timestamp of the URL object is updated, the state is changed into a crawling state when the URL object is taken out by the crawler and starts downloading, if crawling is successful, the state is changed into crawling failure, otherwise, the state is changed into crawling failure, and the URL is recorded into the crawling failure collection for subsequent processing;

a private interface HashRing is designed by using consistent Hash, so that the balance and the dispersity of task allocation are guaranteed, the monotonicity of a system under extreme conditions is guaranteed, a rule scheduling thread inherits the interface to realize the task, a current surviving crawler node in the system is obtained through a Redis database before the rule scheduling thread is scheduled each time and added into a survival list, then the survival list is compared with a last survival list, and two vectors are generated through comparison: adding vectors containing nodes which are added currently, subtracting the vectors containing nodes which are dead currently, updating a hash ring state through interface operation, adding each rule into a corresponding crawler node list according to a domain name, and finally putting the URL in each rule queue into the crawler queue according to a specified scheduling rate through traversing the crawler node list;

the rule scheduling rate design is determined by global weight, scheduling queue scale limitation, rule quantity in a crawler and weight and priority parameters specified by each rule, wherein the rule weight and the priority are in direct proportion to the scheduling rate, the numerical range of the priority is 1-10, and the smaller the numerical value is, the higher the priority is;

during formal scheduling, the rules belonging to the same crawler calculate the total scheduling rate in a summing mode, then calculate the space size available for scheduling according to the difference between the scheduling queue size limit and the current number of the scheduling queue, each rule determines how much available space can be used by the rule in a percentage mode, if the number of URLs in the current rule queue is more than the size of the available space in the scheduling, the URLs with the available space number are scheduled, and if the number of URLs in the current rule queue is less than the available space, the residual space is accumulated to the next rule until all the rules are scheduled.

The master-slave distributed content crawling robot for advertisement delivery further provides a segmented cycle detection method to ensure that the rule scheduling thread in dormancy can still respond to the hot update of the rule in time in consideration of the fact that the rule updating thread indirectly affects the scheduling frequency of the rule scheduling thread: the method comprises the steps that firstly, event marks are shared between two threads, the rule scheduling thread enters normal sleep after being scheduled each time under normal conditions, the rule updating thread also enters sleep after checking a rule version number each time, but the sleep fragmentation design of the rule scheduling thread is adopted, the sleep time is divided into small time slices, whether the event marks are changed or not is determined after the event marks are temporarily dormant each time to determine whether the next sleep is carried out, and after the rule updating thread finds out that the rule version is changed, the event marks are set before the sleep to inform the rule scheduling thread to immediately carry out the next scheduling.

Master-slave distributed content crawling robot for advertisement delivery, and further, the design of a crawling device and a storage module: the crawler comprises a downloader component, a URL extraction feedback component and a DNS analysis component, wherein the downloader component, the URL extraction feedback component and the DNS analysis component are in a complete sequential execution relationship and are packaged into a thread unit, the DNS analysis component is shared among threads and is designed into a reentrant shared function, the crawler finishes concurrent capture analysis work by virtue of a thread pool formed by the cooperation of a plurality of thread units, a local domain name analysis cache is firstly obtained through a DNS analysis execution segment after a capture task is obtained each time, if the capture task is obtained, the local domain name analysis cache directly enters the downloader component, otherwise, the DNS analysis request is carried out for waiting for a response result, the download execution segment requests the result to be delivered to a URL extraction feedback execution flow, the extracted URL is fed back to a scheduler, and the request result is forwarded to a storage module, so that a complete function execution flow is finished;

the execution flow of the download component is as follows: acquiring a grabbing target from a task queue, constructing request header information, trying to fully simulate the real behavior of a browser, encapsulating an agent IP and other components, sending an http request, waiting for a server to respond within a limited time, decompressing a response result, performing coding analysis operation, and returning page information in a unified format;

the extraction feedback component firstly extracts all link information in a page by adopting a rule expression, delivers the link information to a URL (Uniform resource locator) normalization method for filtering, filters links with non-URL properties, then carries out uniform lowercase conversion, restores URL code characters, equivalently replaces a URL directory structure, removes a default port, finally decomposes the URL into a mode type, a host address, a directory path and a request parameter, reassembles the mode type, the host address, the directory path and the request parameter to form a complete URL, forwards the normalized URL to the rule filtering method, respectively carries out rule matching on a branch type URL rule expression list and a node type URL rule expression list in the rule to determine the type of a new URL, wherein the branch type URL filtering priority is higher than the node type URL and has matching priority, and if the matching is successful, the URL, the type of the URL and a father URL of the URL form a triple feedback scheduler.

The master-slave distributed content crawling robot for advertisement delivery further reduces the influence of database IO operation on system performance by adopting a multithreading pool design through a storage module, and comprises the following steps: extracting webpage content and storing data in a distributed manner, extracting tags by means of Xpath, title, keywords, description and text by adopting open source plug-ins python-Goose, and analyzing Chinese content and additionally loading a Chinese word segmentation module;

the large data blocks are divided into a plurality of small data blocks by adopting horizontal expansion to be stored on a plurality of independent devices or shards in a distributed mode, the horizontal expansion adopts a distributed storage method based on a shard cluster mechanism, the shard cluster consists of a configuration server, specific shards and a distribution route, the process of shards for actually storing data information provides consistent data access to the outside, and a data backup mechanism with a master-slave synchronization structure is adopted.

The master-slave distributed content crawling robot for advertisement delivery further comprises an agent module: the agent module provides real-time available agent IP for the crawler to avoid IP being sealed and killed due to too high crawling speed and too large crawling number of a plurality of crawling nodes in the local area network, periodically captures a set agent website to acquire basic IP address and port information, then detects the effectiveness and speed of the agent through a set of preassigned verification information, and adds the agent meeting the requirements into a Redis agent pool;

the complete proxy process: firstly, a client establishes connection with a designated proxy server, then specific protocol parameter processing is carried out according to a proxy protocol used by the proxy server, then connection establishment of a target server is requested to obtain data, the proxy server in the process serves as a cache to cache file data requested by the client to the local, the proxy server does not request the target server any more but directly returns a result when the client requests next time, response speed is greatly improved, and network flow is reduced.

The master-slave distributed content crawling robot for advertisement delivery further realizes a central hub scheduler: the scheduler is a multithreading URL control management module based on Redis, and comprises URL duplication removal, rate scheduling and URL balanced distribution, and comprises a top-level domain name function (get _ top _ domain) for obtaining a URL, an update _ link _ state function (update _ link _ state) and a URL scheduling check function (check _ link);

the scheduler obtains the top-level domain name: returning the top-level domain name of any URL (uniform resource locator) by taking the URL as a parameter, firstly defining a top-level domain name suffix dictionary as a matching standard, defining a regular expression regx, matching character strings starting with any number of point numbers and ending in the median of any dictionary, pre-compiling the regular expression in a mode of ignoring capital and small writing, analyzing a host field part in the URL by using a urlparse module, then matching, returning a packet value if matching is successful, and otherwise, returning a null value;

the dispatcher creates and updates URL status information: inputting a LINK object and a status character string returned by a crawler, wherein the LINK object comprises a URL value, a parent URL value and a URL type, firstly calculating a hash value of the URL by using an MD5 digest algorithm, then searching the URL from a deduplication hash table of Redis, and if the URL is not found, newly establishing a URL information object, wherein the steps of: creating time, tracking information, time of last state change, time of last change to a scheduling state, current state information, a state change list and a URL type, serializing by using JSON, adding the JSON serialized information into a Redis deduplication hash table, and if an information object of the URL is found, modifying a current state value and the state change list and updating the current state value and the state change list into the Redis deduplication hash table;

scheduler URL scheduling check: similarly, inputting a LINK object returned by a crawler and a rule dictionary xspider corresponding to the LINK object as parameters, acquiring a URL information object from a duplicate hash table and judging the current state of the URL object, counting the frequency of failure states in a state change list if the state is failure, comparing the frequency with a preset opportunity value, giving the next opportunity or giving up the URL information object, giving up the URL information object if the state is scheduled or crawling if the state is not failure, and giving scheduling opportunities if the state is finished and the type is a branch if the state is finished and the scheduling interval is modified to be 0, wherein the scheduling opportunities are allowed to be given according to the condition that the scheduling interval is met or not if the state is not failed, and the uniform distribution of the URL among nodes of the crawler is realized by distributing domain names through a consistent hash ring;

the global function is followed by a rule updating thread and a rule scheduling thread, and the consistent hash distribution in the rule updating thread is realized as follows: firstly, a hash ring object is defined outside a loop, three list survival node lists (now _ alive), an increase vector list (incrasture) and a decrease vector list (decrease) are defined during each loop, if the loop is the first loop, a last survival list (last _ alive) is created in a running dictionary, the current survival node is read from Redis to the now _ alive at first in the beginning of running, and the last survival list (last _ alive) in the running dictionary is compared to form an increase vector and a decrease vector.

Master-slave distribution content crawling robot for advertisement delivery, and further implementation of a crawling device: the crawler realizes a multithreading pool structure module by means of a Redis queue model, three capturing threads are established in a circulating mode, the capturing threads are set to be in a daemon state through setDaemon before the threads are started, the threads in the daemon state do not have independent survival authority, and namely when all threads in the daemon state exist, the process is terminated;

each thread class in the crawler comprises downloading and preprocessing of pages, URL extraction and normalization filtering and page content feedback;

page downloading and preprocessing: firstly, a URL _ break module is adopted to analyze a host address field in a URL, a cookie processing object cookie _ handler is packaged, then an agent IP is randomly acquired from a redis, if the agent IP exists, an agent processing object proxy is packaged, and finally an opener object is assembled to be used for sending a request; the build request header information headers contains: the refer field marks the position of the requested link, so that the resource stealing link and malicious content crawling robot program are prevented; the Host field indicates a request Host, and solves the identification problem of the virtual Host under the same IP; the User-agent field indicates the client identity and the client system version information;

the method comprises the steps that a Request object Request is constructed by headers head information and a URL, a Request object is opened through an opener to read a page file, if the page file is normally obtained, the page compression condition is firstly known through response head information to conduct GZIP decompression, then the problem of a unified coding format is solved, a chardet module automatically judges that page codes are uniformly converted into utf-8, decoding processing is conducted on GB2312 and GBK through the Chinese codes in a unified mode through GB18030, URL extraction and normalization filtering adopt a rule expression to extract URLs, normalization processing is conducted through a preset rule, finally the eligible URLs are filtered through the rule expression and fed back to a scheduler, and a rule expression for extracting all URLs in the page is constructed.

Compared with the prior art, the invention has the following contributions and innovation points:

firstly, aiming at the early-stage webpage classification work in advertisement delivery, the invention provides a distributed network content crawling robot system capable of directionally capturing specific classification information aiming at multiple sites, the system can simultaneously capture multiple tasks, control the scheduling rate of each task, realize task balanced distribution among a plurality of crawler nodes, extract and store preliminary information of the captured webpage information in a distributed manner, periodically update the content of the captured webpage, capture thousands of webpages facing to advertisement delivery simultaneously, continuously detect webpage update while distinguishing from the stored webpage, automatically judge update intervals, set an update strategy specially facing to advertisement delivery, avoid repeatedly capturing webpages which are not changed, the system architecture is simple, the coupling degree of each module is low, and the system is convenient for providing real-time webpage classification and analytic information expansion to an advertisement delivery platform in the later stage, the system has good platform compatibility, each module is easy to deploy and configure, the system helps an internet advertisement delivery system to process webpage information, efficiently indexes and manages the webpage information, and helps to integrate advertisements into a network to create more additional value;

secondly, the invention designs an advertisement delivery oriented master-slave distributed content crawling robot aiming at the advertisement delivery requirement based on the content correlation, realizes that the pages to be delivered are classified and collected to be used as a training set, simultaneously provides a function of caching the webpage information of the advertisement to be delivered, has the real-time analysis capability of expanding the webpage delivery platform to the advertisement delivery, successfully develops and realizes a lightweight master-slave distributed content crawling architecture, extracts a URL equilibrium scheduling strategy under multiple tasks, a concurrent grabbing strategy of a crawler and an updating strategy of the existing webpage information, analyzes and stores the crawled webpage information in a distributed way, and simultaneously ensures that the system is simpler and easier to deploy on the premise of meeting the requirement, the lightweight master-slave distributed content crawling robot based on Python the design and realization of the actual requirement of the advertisement delivery system provides data support for the whole advertisement delivery platform, the system has the advantages of obvious speed and scale, can easily solve the bottleneck problem caused by system resources, has good customizability for advertisement putting, can meet the actual requirement of advertisement putting, and has important practical significance and huge application value;

thirdly, the master-slave distributed content crawling robot for advertisement delivery provided by the invention has the advantages that the information acquisition efficiency and the acquisition scale can meet the actual application requirements, the robot is not limited by hardware such as a memory and a processor and resources in the aspects of bandwidth, local faults occur, the whole system is not broken down, and the bottleneck problem caused by system resources can be effectively solved; the invention combines the design idea of a distributed system with the crawling of network contents, forms a distributed crawler when a plurality of network crawlers are distributed in different address sections or different geographical positions of a network and cooperatively work, distributes the distributed crawlers to different nodes for parallel crawling by scheduling of task units, and each node can fully utilize hardware and network resources of the node to complete a crawling task.

Drawings

Fig. 1 is a general structural diagram of a master-slave distributed content crawling robot of the present invention.

FIG. 2 is a schematic diagram of the basic operation flow of the master-slave distributed content crawling robot.

FIG. 3 is a diagram of a hub scheduler architecture for a content crawling robot.

FIG. 4 is a URL state transition diagram of the pivot scheduler of the present invention.

FIG. 5 is a schematic diagram of the basic structure of the crawler thread of the present invention.

FIG. 6 is a schematic diagram of the operation scheduling rules of the pivot scheduler.

Fig. 7 is a diagram illustrating a fragmented sleep mode of the pivot scheduler according to the present invention.

FIG. 8 is a graph showing comparison of data amounts collected in time-phased experiments under different nodes.

FIG. 9 is a graph of collected data versus time period for experiments at different nodes.

Detailed description of the invention

The technical solution of the master-slave distributed content crawling robot for advertisement delivery provided by the present invention is further described below with reference to the accompanying drawings, so that those skilled in the art can better understand the present invention and can implement the present invention.

The rapid development of the internet brings about explosive increase of information quantity, how to integrate advertisements into the internet to create extra value becomes very important, the basic work of the internet advertisement delivery system is the processing of webpage information, and the efficient indexing and management of the webpage information are particularly important.

Firstly, based on the actual demand of advertisement putting, the invention provides a master-slave distributed content crawling robot to index webpage information and periodically update, and a training set and a test set are constructed to realize the quick response of advertisement putting pages.

Secondly, the content crawling robot consists of a central scheduler and a plurality of crawling crawlers, the crawling rule configuration and the real-time monitoring of the running state are realized based on a Web mode, a Redis memory database is adopted between the central scheduler and crawler nodes to realize two-way communication, the central scheduler adopts a multi-level URL queue to realize URL rule matching and duplicate removal operation, meanwhile, the regular rescheduling is carried out by periodically monitoring the change of the crawling rules, the scheduling rate of each rule queue is adjusted according to the configuration, and the task balancing is carried out among the plurality of crawling crawlers through a consistent Hash algorithm; in addition, each crawling crawler concurrently requests URLs by adopting a multi-thread pool design, page links are extracted firstly, metadata and text contents are extracted by adopting an open source Goose module, distributed storage is carried out on the contents by adopting a fragment and copy set mechanism, and meanwhile, a proxy IP is used for preventing a website from shielding the contents and crawling a robot.

Overall framework of distributed content crawling robot

Structure and operation design of content crawling robot

The overall structure of the master-slave distributed content crawling robot is shown in fig. 1, and comprises a central hub scheduler, a plurality of crawler nodes, a Web configuration management module, an agent pool module, a distributed storage module, a Redis database and a MongoDB database. The basic operation flow of the master-slave distributed content crawling robot is shown in fig. 2.

Firstly, starting a MongoDB distributed database and a Redis database, confirming normal operation of the MongoDB distributed database and the Redis database, then starting a Web configuration management module flash to monitor a local 5000 port, sequentially configuring and storing a capturing rule specified for a specific site into the Redis database, then starting a central hub scheduler, firstly loading a system configuration file to the global by the scheduler, loading a rule updating thread, reading a pre-established capturing rule from the Redis by the thread to a global dictionary and updating rule version information, creating a scheduling thread by the thread and starting the scheduling thread, periodically and dynamically monitoring a rule version number, activating a scheduling mark to inform the scheduling thread to immediately start new scheduling once when rule change is found, and carrying out two processes in one scheduling process by the scheduling thread: the method comprises the steps of rule initial scheduling and rule queue balanced scheduling, wherein the rule initial scheduling is driven by a rule seed list to perform updating scheduling once according to whether an updating period is reached, the rule queue scheduling determines the scheduling rate of each queue according to the priority and the weight of each queue in the configuration, then the scheduling state is removed and updated, and a consistent Hash process is put into a scheduling queue of a current survival crawler; after the agent pool is started, the agent pool is responsible for periodically obtaining the validity detection of the real-time agent IP information of the Internet and then throwing the effective agent IP information into Redis; after the crawler is started, loading a plurality of thread units, loading URLs from corresponding scheduling queues, delivering the URLs to a downloading component, extracting the URLs after downloading, feeding the URLs back to a scheduler, putting pages into a data queue to wait for processing of a storage module, finally starting the storage module, preprocessing the pages, extracting effective information, storing the effective information into a MongoDB of a distributed database, and monitoring the running state of a system through a FlaskWeb background; the components are independent from each other and can be deployed on different machine nodes, so that the effective utilization of resources is realized.

Design of (II) central hub scheduler

The structure of a central scheduler class diagram is shown in fig. 3, a core scheduling class is composed of a rule updating thread class and a rule scheduling thread class, the three classes work together depending on an environment class, the environment class provides global operation dictionary variables, stores real-time capture rules and some global shared variable information, and also provides Redis database connection pool variables, and the work class defines two static methods, namely: the link detection method carries out link state scheduling logic, executes actual scheduling operation simultaneously, creates a new URL object class for the link which does not appear, judges the current state of the existing URL object, counts the times of scheduling participation and scheduling failure according to a scheduling record table if the current state is a crawling failure state, abandons the capturing of the URL if the times exceed a set value, otherwise gives the opportunity of participating in scheduling again when the scheduling time interval is met, if the current state is not a failure, abandons the scheduling if the current state is crawling or scheduled, if the current state is crawling success and the type of the URL is a branch, the URL is scheduled immediately based on the condition that the content of the branch node changes fast, and schedules under the condition that the time interval is met under all the remaining conditions; the second method comprises the following steps: and simultaneously updating the URL object information based on the hash duplicate removal function.

The URL state transition is shown in figure 4, the scheduling state is not only an initial state after URL creation but also a state when a new schedule starts, the URL changes into a scheduling state mark that the URL enters a scheduling queue of a certain crawler and the scheduling timestamp of the URL object is updated, the state is changed into a crawling state when the URL is taken out by the crawler and starts downloading, the state is changed into crawling success after crawling is successful, otherwise the state is changed into crawling failure, and the URL is recorded to a crawling failure set for subsequent processing.

In order to ensure that tasks are uniformly distributed to all crawlers, a consistent HashRing is used for designing a private interface, balance and dispersity of task distribution are guaranteed, monotonicity of a system under extreme conditions is guaranteed, a rule scheduling thread inherits the interface to realize the task, a current surviving crawler node in the system is obtained through a Redis database before the rule scheduling thread is scheduled each time and added into a survival list, then the survival list is compared with a last survival list, and two vectors are generated through comparison: and adding the added vector containing the nodes which are added currently, adding the subtracted vector containing the nodes which are dead currently, updating the hash ring state through interface operation, adding each rule into a corresponding crawler node list according to the domain name, and finally putting the URL in each rule queue into the crawler queue according to the specified scheduling rate through traversing the crawler node list.

The rule scheduling rate design is determined by global weight, scheduling queue size limitation, rule quantity in a crawler and weight and priority parameters specified by each rule, wherein the rule weight and the priority are in direct proportion to the scheduling rate, the priority numerical range is 1-10, and the smaller the numerical value is, the higher the priority is.

Considering that the rule updating thread indirectly affects the scheduling frequency of the rule scheduling thread, a fragmented loop detection method is provided to ensure that the rule scheduling thread in the dormancy can still respond to the hot update of the rule in time: the method comprises the steps that firstly, event marks are shared between two threads, the rule scheduling thread enters normal sleep after being scheduled each time under normal conditions, the rule updating thread also enters sleep after checking a rule version number each time, but the sleep fragmentation design of the rule scheduling thread is adopted, the sleep time is divided into small time slices, whether the event marks are changed or not is determined after the event marks are temporarily dormant each time to determine whether the next sleep is carried out, and after the rule updating thread finds out that the rule version is changed, the event marks are set before the sleep to inform the rule scheduling thread to immediately carry out the next scheduling.

(III) design of crawler and storage module

The crawler comprises a downloader component, a URL extraction feedback component and a DNS analysis component, wherein the downloader component, the URL extraction feedback component and the DNS analysis component are in a complete sequential execution relationship and are packaged into a thread unit, the DNS analysis component is shared among threads and is designed into a reentrant shared function, the crawler completes concurrent capture analysis work by means of a thread pool formed by cooperation of a plurality of thread units, the basic structure of the thread of the crawler is shown in figure 5, after a capture task is obtained each time, a local domain name analysis cache is obtained through a DNS analysis execution segment, if the capture task is obtained, the local domain name analysis cache directly enters the downloader component, otherwise, DNS analysis is carried out to request for a response result, the download execution segment requests for the result to be delivered to a URL extraction feedback execution stream, the extracted URL is fed back to a scheduler, the request result is forwarded to a storage module, and a complete function execution stream is completed.

The execution flow of the download component is as follows: the method comprises the steps of firstly constructing request header information after acquiring a captured target from a task queue, trying to fully simulate the real behavior of a browser, encapsulating an agent IP and other components, sending an http request, waiting for a server response within a limited time, decompressing a response result, performing coding analysis operation, and returning page information in a unified format.

The storage module adopts a multithreading pool design to reduce the influence of database IO operation on system performance, and the method comprises the following steps: extracting webpage content and storing data in a distributed mode, extracting labels by means of Xpath, title titles, keyword words, description and texts by adopting open source plug-ins python-Goose, and analyzing Chinese content and additionally loading a Chinese word segmentation module.

Because the memory and hard disk of the single machine system have very limited storage resources, a large query quantity can exhaust the CPU of the single machine, the storage pressure of a large data quantity on the single machine is large, the memory of the system can be exhausted finally, and the pressure is transferred to the disk IO, the problem is solved by adopting horizontal expansion, a large data block is divided into a plurality of small data blocks to be stored on a plurality of independent devices or sub-pieces in a distributed manner, when the data scale is enlarged, the problem can be solved by only adding simple devices or sub-pieces, the IO access pressure is averagely distributed to a plurality of places, and the bottleneck of IO efficiency can not be generated.

The horizontal expansion adopts a distributed storage method based on a fragment cluster mechanism, the fragment cluster consists of a configuration server, specific fragments and a distribution route, wherein the process of actually storing data information by the fragments provides consistent data access to the outside, and a data backup mechanism with a master-slave synchronization structure is adopted.

The query routing receives a client request, reads data from a correct fragment and returns the data to the client, the configuration server stores meta-information of a document set, the meta-information comprises a mapping relation from the set to a specific fragment, the query routing depends on the meta-information to find a specific fragment server for user query, and the document structure storage is divided into two sets: a page information set page _ info and an extraction description information set meta _ info.

(IV) design of agent Module

The agent module provides real-time available agent IP for the crawler to avoid IP blocking caused by too high crawling speed and too large crawling quantity of a plurality of crawling nodes in the local area network, periodically captures a set agent website to acquire basic IP address and port information, then detects effectiveness and speed of the agent through a set of preassigned verification information, and adds the agent meeting requirements into a Redis agent pool.

The complete proxy process is: the method comprises the steps that firstly, a client side is connected with a designated proxy server, then specific protocol parameter processing is carried out according to a proxy protocol used by the proxy server, then connection establishment of a target server is requested to obtain data, the proxy server in the process serves as a cache, file data requested by the client side are cached to the local, the proxy server does not request the target server any more when the client side requests the next time, and results are directly returned, so that the response speed is greatly improved, and the network flow is reduced.

Second, deployment implementation of distributed content crawling robot

Implementation of a hub scheduler

The scheduler is a multithreading URL control management module based on Redis, and comprises URL deduplication, rate scheduling and URL equilibrium distribution, including a top-level domain name function (get _ top _ domain) for acquiring the URL, an update _ link _ state function (update _ link _ state), a URL scheduling check function (check _ link),

the scheduler obtains the top-level domain name: and returning the top-level domain name of the URL by taking any URL as a parameter, firstly defining a top-level domain name suffix dictionary as a matching standard, defining a regular expression regx, matching character strings starting with any number of point numbers and ending in the median of any dictionary, pre-compiling the regular expression in a mode of ignoring capital and small writing, analyzing the host field part in the URL by using a urlparse module, then matching, returning a packet value if the matching is successful, and otherwise, returning a null value.

The dispatcher creates and updates URL status information: inputting a LINK object and a status character string returned by a crawler, wherein the LINK object comprises a URL value, a parent URL value and a URL type, firstly calculating a hash value of the URL by using an MD5 digest algorithm, then searching the URL from a deduplication hash table of Redis, and if the URL is not found, newly establishing a URL information object, wherein the steps of: creating time, tracking information, time of last state change, time of last change to a scheduling state, current state information, a state change list and a URL type, serializing by using JSON, adding into a Redis deduplication hash table, if an information object of the URL is found, modifying a current state value and the state change list, and updating into the Redis deduplication hash table.

Scheduler URL scheduling check: and similarly, inputting a LINK object returned by the crawler and a rule dictionary xspider corresponding to the LINK object as parameters, acquiring the URL information object from the duplicate hash table and judging the current state of the URL object, counting the frequency of failure states in a state change list if the state is failure, comparing the frequency with a preset opportunity value, giving the next opportunity or giving up the URL information object, giving up the URL information object if the state is scheduled or crawling if the state is not failure, giving up the URL information object if the state is scheduled or crawling, modifying a scheduling interval to be 0 to allow the URL information object to participate in scheduling if the state is completed and the type is a branch, and giving scheduling opportunities according to whether the scheduling interval is met or not if the state is not failed, wherein the uniform distribution of the URL among nodes of the crawler is realized by distributing domain names through a consistent hash ring.

The rule scheduling thread performs two things in each cycle as shown in fig. 6, judges whether to reschedule each rule or not and performs rule queue scheduling, puts the seeds in the seed list of each rule into the corresponding crawler queue, and creates the updated URL information object if it does not exist.

And specific rule scheduling execution: and calculating the number of URLs which can be scheduled in the scheduling according to the average rate, the rate parameter of the user and the residual capacity, comparing the number with the real number in the scheduling queue, taking a smaller value as the number which can be actually scheduled at this time, sequentially putting the smaller value into a crawler queue, and returning and transferring the residual schedulable space serving as the residual capacity to the next rule after the scheduling is finished.

An indirect control logic is arranged between the rule updating thread and the rule scheduling thread, an Event object Event in a thread module is adopted to complete the communication between the threads, if an Event mark is true, the execution is continued, if the Event mark is false, the blocking state is entered until the Event mark is true, the rule scheduling thread can not only carry out periodic dormancy, but also respond to the Event mark set by the rule updating thread, so that the rule scheduling thread adopts a fragmentation dormancy mode, specifically shown in figure 7, the whole dormancy period is divided into 100 parts for circular dormancy, the state of the Event mark is detected before each dormancy, the dormancy is immediately ended if the Event mark is true, otherwise, the circulation is continued, the mode can not idle CPU resource consumption, and the thread notification can be immediately responded.

(II) implementation of the crawler

The crawler realizes a multithreading pool structure module by means of a Redis queue model, three capturing threads are established circularly, the capturing threads are set to be in a daemon state through setDaemon before the threads are started, the threads in the daemon state do not have independent survival authority, and namely when all the threads are in the daemon state, the process is terminated.

Each thread class in the crawler comprises downloading and preprocessing of pages, URL extraction and normalization filtering and page content feedback.

Thirdly, integrated test and result evaluation of crawling robot

According to the design of the invention, firstly, a scheduler is started to pre-load the rule, an agent pool is started to acquire the real-time agent IP, the distributed parallel capturing is carried out by starting crawlers on a plurality of desktops, finally, the overall running condition of the system is monitored through an open-source Redis manager and a Web background monitoring part, a Redis database A comprises system configuration files, rule version numbers, all added rules, the survival state of a scheduler and the survival state of all crawlers, a Redis database B stores all url objects added with system crawling by using a hash table data structure, the url objects are used for url duplicate removal check and the record of current state query and state transition during rule scheduling, a rule queue of each rule and a scheduling queue of each crawler are stored in a Redis database C, a data storage buffer queue is stored in a Redis database D, and each queue item comprises original page information and url of a page.

According to the distributed deployment architecture designed by the invention, the crawling speeds and the collected data volume under a single crawler node, three crawler nodes and five crawlers are respectively tested, the result is shown as figure 8, a drawing is carried out to obtain an intuitive comparison curve as shown in figure 9, the slope of a line drawn in the drawing shows that the collection efficiency is in direct proportion to the number of nodes of the opened crawlers, and the effect of multiple crawlers is more and more obvious along with the time.

Claims

1. The master-slave distributed content crawling robot for advertisement delivery is characterized by being designed to realize a Redis-based distributed content crawling robot, and performing grabbing and storing in a distributed network deployment mode; the overall architecture of the distributed content crawling robot comprises a hub scheduler, a plurality of crawler nodes, a Web configuration management module, an agent pool module, a distributed storage module, a Redis database and a MongoDB database, and specifically comprises the following steps:

2. The master-slave distributed content crawling robot for advertisement delivery according to claim 1, wherein the basic operation flow of the master-slave distributed content crawling robot is as follows: firstly, starting a MongoDB distributed database and a Redis database, confirming normal operation of the MongoDB distributed database and the Redis database, then starting a Web configuration management module flash to monitor a local 5000 port, sequentially configuring and storing a capturing rule specified for a specific site into the Redis database, then starting a central hub scheduler, firstly loading a system configuration file to the global by the scheduler, loading a rule updating thread, reading a pre-established capturing rule from the Redis by the thread to a global dictionary and updating rule version information, creating a scheduling thread by the thread and starting the scheduling thread, periodically and dynamically monitoring a rule version number, activating a scheduling mark to inform the scheduling thread to immediately start new scheduling once when rule change is found, and carrying out two processes in one scheduling process by the scheduling thread: the method comprises the steps of rule initial scheduling and rule queue balanced scheduling, wherein the rule initial scheduling is driven by a rule seed list to perform updating scheduling once according to whether an updating period is reached, the rule queue scheduling determines the scheduling rate of each queue according to the priority and the weight of each queue in the configuration, then the scheduling state is removed and updated, and a consistent Hash process is put into a scheduling queue of a current survival crawler; after the agent pool is started, the agent pool is responsible for periodically obtaining the validity detection of the real-time agent IP information of the Internet and then throwing the effective agent IP information into Redis; after the crawler is started, loading a plurality of thread units, loading URLs from corresponding scheduling queues, delivering the URLs to a downloading component, extracting the URLs after downloading, feeding the URLs back to a scheduler, putting pages into a data queue to wait for processing of a storage module, finally starting the storage module, preprocessing the pages, extracting effective information, storing the effective information into a MongoDB of a distributed database, and monitoring the running state of a system through a FlaskWeb background; the components are independent from each other and can be deployed on different machine nodes, so that the effective utilization of resources is realized.

3. The master-slave distributed content crawling robot for advertisement placement according to claim 1, wherein the design of the hub scheduler: the central hub scheduling class consists of a rule updating thread class and a rule scheduling thread class, the three classes work together depending on an environment class, the environment class provides global operation dictionary variables, stores real-time capturing rules and some global shared variable information, and also provides Redis database connection pool variables, and the work class defines two static methods:

4. The advertisement delivery-oriented master-slave distributed content crawling robot according to claim 3, wherein the data items in the URL object include creation time, rule names, types, a timestamp to be scheduled for the last time and a scheduling record table, the scheduling state is an initial state after URL creation and a new state when scheduling starts, the URL is changed into a scheduling state mark that the URL object enters a scheduling queue of a certain crawler and the scheduling timestamp of the URL object is updated, the state is changed into a crawling state when the URL object is taken out by the crawler and downloading starts, the state is changed into crawling success if crawling succeeds, otherwise, the state is changed into crawling failure, and the URL is recorded to a crawling failure set for subsequent processing;

5. The advertisement delivery-oriented master-slave distributed content crawling robot according to claim 4, wherein a fragmented loop detection method is provided to ensure that a rule scheduling thread in a dormancy can still respond to a hot update of a rule in time, in consideration of the fact that the rule updating thread indirectly affects the scheduling frequency of the rule scheduling thread: the method comprises the steps that firstly, event marks are shared between two threads, the rule scheduling thread enters normal sleep after being scheduled each time under normal conditions, the rule updating thread also enters sleep after checking a rule version number each time, but the sleep fragmentation design of the rule scheduling thread is adopted, the sleep time is divided into small time slices, whether the event marks are changed or not is determined after the event marks are temporarily dormant each time to determine whether the next sleep is carried out, and after the rule updating thread finds out that the rule version is changed, the event marks are set before the sleep to inform the rule scheduling thread to immediately carry out the next scheduling.

6. The master-slave distributed content crawling robot for advertisement delivery according to claim 1, wherein the crawler and storage module is designed as follows: the crawler comprises a downloader component, a URL extraction feedback component and a DNS analysis component, wherein the downloader component, the URL extraction feedback component and the DNS analysis component are in a complete sequential execution relationship and are packaged into a thread unit, the DNS analysis component is shared among threads and is designed into a reentrant shared function, the crawler finishes concurrent capture analysis work by virtue of a thread pool formed by the cooperation of a plurality of thread units, a local domain name analysis cache is firstly obtained through a DNS analysis execution segment after a capture task is obtained each time, if the capture task is obtained, the local domain name analysis cache directly enters the downloader component, otherwise, the DNS analysis request is carried out for waiting for a response result, the download execution segment requests the result to be delivered to a URL extraction feedback execution flow, the extracted URL is fed back to a scheduler, and the request result is forwarded to a storage module, so that a complete function execution flow is finished;

7. The master-slave distributed content crawling robot for advertisement delivery according to claim 6, wherein the storage module adopts a multi-thread pool design to reduce the influence of database IO operation on the system performance, and comprises: extracting webpage content and storing data in a distributed manner, extracting tags by means of Xpath, title, keywords, description and text by adopting open source plug-ins python-Goose, and analyzing Chinese content and additionally loading a Chinese word segmentation module;

8. The master-slave distributed content crawling robot for advertisement delivery according to claim 1, wherein the agent module is designed to: the agent module provides real-time available agent IP for the crawler to avoid IP being sealed and killed due to too high crawling speed and too large crawling number of a plurality of crawling nodes in the local area network, periodically captures a set agent website to acquire basic IP address and port information, then detects the effectiveness and speed of the agent through a set of preassigned verification information, and adds the agent meeting the requirements into a Redis agent pool;

9. The ad placement oriented master-slave distributed content crawling robot according to claim 1, characterized by the implementation of a hub scheduler: the scheduler is a multithreading URL control management module based on Redis, and comprises URL duplication removal, rate scheduling and URL balanced distribution, and comprises a top-level domain name function (get _ top _ domain) for obtaining a URL, an update _ link _ state function (update _ link _ state) and a URL scheduling check function (check _ link);

10. The master-slave distributed content crawling robot for advertisement delivery according to claim 1, characterized in that the implementation of the crawler: the crawler realizes a multithreading pool structure module by means of a Redis queue model, three capturing threads are established in a circulating mode, the capturing threads are set to be in a daemon state through setDaemon before the threads are started, the threads in the daemon state do not have independent survival authority, and namely when all threads in the daemon state exist, the process is terminated;