CN105279272A

CN105279272A - Content aggregation method based on distributed web crawlers

Info

Publication number: CN105279272A
Application number: CN201510724024.6A
Authority: CN
Inventors: 黄韬; 魏亮; 魏静波; 邓晓涛; 周洪利
Original assignee: NANJING FUTURE NETWORKS INDUSTRY INNOVATION Co Ltd
Current assignee: NANJING FUTURE NETWORKS INDUSTRY INNOVATION Co Ltd
Priority date: 2015-10-30
Filing date: 2015-10-30
Publication date: 2016-01-27

Abstract

The invention provides a content aggregation method based on distributed web crawlers, which comprises the steps that firstly different crawler platforms are arranged at different devices, a request is sent to a crawling network information source end, and the crawler platforms fabricate crawling rules according to target information required by a user and crawl information in which the target user is interested; the crawled network information is processed, similarity detection is carried out based on a data transmission and conversion method in a real-time database and by being combined with a locality sensitive hashing (LSH) method so as to reduce the redundancy of the information; and the information is classified and sorted by the system according to the category, the heat and keywords and then displayed on user equipment. According to the method provided by the invention, LSH and similarity comparison are carried out according to the data information acquired in an actual network so as to acquire a comparison result. Compared with a comparison result acquired by adopting a traditional mode of whole data duplication checking in the prior art, the content aggregation method is higher in calculation speed and more accurate in similarity comparison.

Description

A kind of content polymerization process based on distributed network reptile

Technical field

The present invention relates to spiders correlative technology field, particularly a kind of content polymerization process based on distributed network reptile.

Background technology

Along with the development of internet, large data age comes head-on, and the value of mass data also will obtain more embodiments.Due to day by day increasing progressively of the internet informations such as magnanimity streaming media video resource and abundant web page contents, specific user is allowed to be difficult to the network data needed for accurate and effective acquisition self in limited chip time section by handheld device.And existing content syndication technologies carries out mainly with the mode based on superstructure the superiority that analogue simulation proves its content-aggregated system greatly, lack and realization application is carried out to real network environment and customizing messages corresponding to potential user group.

Filtercondition selected by traditional content polymerization process is too extensive, and cannot carry out mass customization obtaining information, is difficult to the promptness of guarantee information and the correlativity of theme.Cannot adapt to the quick irregular change then of obtaining information end gauage, cause the transience of information source, acquisition of information cannot be carried out for a long time.The identical information different to source multiple on internet cannot be distinguished, and causes repeatability and the redundancy of information, reduces the high efficiency of targeted customer's obtaining information.Therefore for the worth research such as acquisition of information persistence, the de-redundant remaining of information, the polymerization classification of information how improving content aggregation platform.

Summary of the invention

Present invention achieves a kind of content polymerization process based on distributed network reptile, object is to solve web crawlers technology in prior art and can not effectively carries out being polymerized the problem of classifying to customizing the large batch of network information.

A kind of content polymerization process based on distributed network reptile provided by the invention, the method comprises following process:

Step one, capture target information, first different reptile platforms is arranged on different devices, send request to the network information source terminal crawled, reptile platform is formulated according to the required target information of user and is crawled rule, captures the information interested to targeted customer;

Step 2, crawl content information similarity to detect, the described network information obtained that crawls is processed, based on the data transmission in real-time data base and conversion method thereof in conjunction with local sensitivity Hash (LSH) method, carries out similarity detection thus the redundance of the information of reduction;

Step 3, crawl information fusion classification, on the basis of step 2, system carries out classification and ordination to garbled information category, temperature, key word, and shows on a user device.

In described step one, for crawling, arranging of platform is further comprising the steps of:

Before task starts, reptile platform is disposed, and configure reptile attribute, filtered and irrelevant the linking of user search by web page analysis algorithm, the link remained with is put in queue to be captured, and in filter process, first web page contents is changed into textual form and from queue, select next step url that will capture by text based web page analysis algorithm by background server, repeat above step, traversal full page, until meet the stop condition of program.

Described deployment reptile comprises the service configuration of reptile and task configuration.

Described step one specifically comprises:

Step 1.1, is divided into some large classes by root address url according to its class of service, selects to carry out information crawler with a certain large class corresponding to target information;

Step 1.2, the large class url configuration corresponding according to described target information crawls destination address, enters each page and obtains detailed label, crawl particular content.

Described step one also comprises step 1.3, when proceeding to described step 1.2, if be provided with more detailed classification to information in the described destination address page, enter each group page and obtain detailed label, crawl particular content, and repeat step 1.3, carry particular content until crawl.

Described step 2 specifically comprises:

Step 2.1, processes the described network information obtained that crawls, the blank string in replacement information and multimedia element, picture contained in information and video resource is extracted and replaces to corresponding text language;

Random length in text is that the substring of k is defined as k-shingle by step 2.2, then every bar information can be expressed as occurring that k-shingle once or repeatedly gathers in the text; Need described set to replace to the small set represented with the signature of small-scale, estimated the similarity of actual set by the signature set of comparison information;

Step 2.3, repeatedly local sensitivity Hash process is carried out to information, similar item more may be able to be hashing onto in same bucket than dissimilar item, will have at least the information that is once hashing onto same bucket to as being candidate couple, only to these candidates to carrying out similarity detection, information similarity being reached setting threshold value carries out screening deletion, reduces the redundance of information.

The present invention adopts above technical scheme compared with prior art, has following technique effect:

The method that the embodiment of the present invention provides, carries out similarity comparison after getting information used, proposes redundant information, obtains the information data that described redundance is lower.The method is carried out LSH according to the data message acquired in real network and is carried out similarity comparison and obtain comparing result, and the comparing result looked into double recipe formula with adopting traditional whole piece data in prior art and obtain, its computing velocity is faster, similarity comparison is more accurate.

Accompanying drawing explanation

Below with reference to accompanying drawing, the invention will be further described:

Fig. 1 is the process flow diagram that spiders provided by the present invention crawls rule;

Fig. 2 is the process flow diagram crawling content information similarity testing process provided by the present invention;

Fig. 3 is the process flow diagram crawling information fusion assorting process provided by the present invention;

Fig. 4 is the process flow diagram of reptile distributed deployment method provided by the present invention;

Fig. 5 is reptile distributed deployment system architecture schematic diagram provided by the present invention;

Fig. 6 is the structural representation of the content-aggregated system based on distributed network reptile provided by the present invention.

Embodiment

The invention provides a kind of content polymerization process based on distributed network reptile, for making object of the present invention, clearly, clearly, and the present invention is described in more detail with reference to accompanying drawing examples for technical scheme and effect.Should be appreciated that concrete enforcement described herein is only in order to explain the present invention, is not intended to limit the present invention.

As shown in Figure 6, this system comprises content-aggregated system architecture schematic diagram provided by the invention:

User interface: user is managed and task scheduling system by graphic user interface, dispatch service is responsible for by each node reptile, mainly provides and comprises reptile task start, task stopping and task status service; Graphic user interface is the visualized operation interface that content aggregation platform is supplied to user, reptile task management platform;

This interface, by calling the service interface of bottom, as the attribute status of management reptile node tasks and a central platform of daily record, makes system manager provide easy-to-use, parametric controller intuitively.

Content-aggregated sort module: provide reptile task related command by content-aggregated sort module, controls the concrete state of reptile task.According to customization specific requirement, setting crawls scope in detail, dynamically follows the tracks of crawling state, and the write of the data providing reptile task to be correlated with, update service obtain and crawl particular content.For after described acquisition of information instruction module gets required data, capture the data message obtained from database, become pending data.

Information similarity detection module: carry out similarity comparison after getting information used, eliminate redundancy information, obtains the information data that described redundance is lower.The information of bottom layer node reptile carries out looking into heavily processing by this module, and the information redundance in content-aggregated system is reduced.

Information pre-processing module: after information similarity described above detects, further investigation web page contents, by text based web page analysis algorithm, intercepting page text message, aggregation platform can according to content of text automatic acquisition corresponding informance, such as title, body part etc., thus fill corresponding blank framework, realize handheld terminal correspondence by background transfer and show.

Based on said system, a kind of content polymerization process based on distributed network reptile provided by the present invention comprises following process:

Step 2, crawls content information similarity and detects, the described network information obtained that crawls processed, improve based on database data transmission, in conjunction with local sensitivity Hash (LSH) method, carries out similarity detection thus the redundance of the information of reduction;

Before beginning task, be first introduced the deployment of reptile and configuration, as shown in Figure 4, this process specifically comprises in the reptile distributed deployment that the present invention points out:

Distributed reptile configuration comprises reptile service configuration and the configuration of reptile task; Reptile service configuration, the resource that the service of guarantee relies on can correctly obtain, the normal operation of support mission; Reptile service configuration is configured task attribute, as page-downloading interval time, mission thread number, and the tasks carrying frequency etc.

Distributed reptile is disposed, concrete, take web services as main deployment, need install Tomcat container, is finally issued as war bag, and after each war disposes, geography provides service.In actual deployment, a reptile node disposed by a physical equipment, for convenient test and resource make full use of, can at a multiple container of physical server deploy, and distributing different port provides service.IP address and port uniquely determine a reptile node.

Management platform in task-set, it mainly comprises task scheduling mode, and above-mentioned reptile aggregation management system, and it connects each reptile fringe node, carries out system task scheduling, controls each node state, manages and control task.

More concrete, reptile distributed deployment system architecture schematic diagram as shown in Figure 5:

This schematic diagram presents reptile distributed deployment system architecture, wherein in task-set, management platform is exactly above-mentioned said content-aggregated management platform, it connects each reptile fringe node, the task scheduling to crawler system can be realized, control each node state, supvr can select corresponding working node, manages and control task.

Crawler system is on different devices disposed in the representative of reptile fringe node, embodies distributed deployment.Different equipment represents and different crawls task, and the task matching simultaneously on node also can intersect realization.Because the task between each equipment is independently, necessarily do not rely on each other, this just makes full use of limited resources, improves tasks carrying speed.

During beginning task, the first step captures the process flow diagram of target information as shown in Figure 1, and idiographic flow is as follows:

Step 101: reptile sends request to the network information source station end crawled;

Step 102: crawl rule according to the different business classification configuration of each source station is corresponding, this rule, based on webmagic framework, inherently exists numerous network information source station in internet, namely crawl root address (initial url).To crawl webpage url for root url, based on webmagic framework, acquire all kinds of effective information according to web page source code from source station, carry out relevant configuration.As picture category, then detailed analyzing web page form, by picture tag sort and marking, is illustrated in the page of content aggregation platform on by picture by mark order by subsequent step; Video class, according to webpage format, obtains video redirect url and finally plays the page, inserting the content aggregation platform page, carries out final displaying.Dynamically can carry out according to the variation of source station the adjustment and the adaptation that crawl rule, if amendment reptile rule, then reptile when upper once obtaining information, will crawl according to the rule that crawls after upgrading.

Such as, client retrieved the interested information such as lastest news, focus, and it is summarized as information class, first crawls destination address according to each information url configuration, then enters each information page and obtain detailed label, crawls particular content.As video, audio class, first enters each video website homepage url, and each visual classification corresponding, this spy for film, TV play, variety three major types, obtains the detailed broadcast address of each page, then obtains each video information from the broadcasting page.

Root address can be categorized as different classes of according to own service, as information class, video class, software class etc.; On this basis, can crawl rule accordingly to the configuration of different source stations, reptile rule has controllability, thus the validity of the information of raising and readability.Existing network crawler technology, is extract url from current page, puts it in queue, until meet the stop condition of program.But source station information is constantly variation, this technology exists cannot the difficult point of this source station information of Obtaining Accurate for a long time.The present invention is according to this situation, and dynamic-configuration crawls rule and carries out self-adapting crawling.Such as information class, comprise lastest news, hot news, when starting task, configuration crawls item (being associated with the 4th step distributed deployment and task management) herein, filtered and information (information website source url) irrelevant linking (being associated with second step) by the web page analysis algorithm based on web page contents, the link (scope of creeping is controlled) remained with is put in queue to be captured, the final page url of the url(selecting next step to capture from queue by certain search strategy), repeat above step, until meet the stop condition of program.

This process dynamically can adapt to the change of information source end, information interested to adaptive acquisition target complex, compared with customizing the crawling method of transience with nothing in enormous quantities of the prior art, it is higher that it crawls the effectiveness of information obtained, and information source timeliness is more of a specified duration.

Crawl content information similarity testing process as shown in Figure 2, idiographic flow is as follows:

Step 201: the described network information obtained that crawls is processed, blank string in replacement information and multimedia element, picture contained in information and video resource are extracted and replaces to corresponding text language, the network information containing multielement is finally stored with text message;

Step 202: select certain k value, gathers every its k-shingle of bar information architecture, these k-shingle is mapped to shorter bucket numbering;

Step 203: the length n selecting min-hash signature, calculates the min-hash signature of every bar information;

Step 204: a threshold value t is set and defines the similarity degree that reach, and make it to be seen as similar right.Select the line number r in row number b and each row bar, make br=n, and threshold value t is approximately equal to.Need to select suitable b and r to be less than the threshold value of t to avoid the generation of pseudo-counter-example with generation, also will take into account the computing velocity of similarity-rough set simultaneously.Adopt LSH technology to build candidate couple after choosing suitable b and r, check the signature that each candidate is right, determine whether their conforming ratios are greater than t, if be greater than t, then delete the repetition that wherein information avoids information.

The method that the embodiment of the present invention provides, carries out similarity comparison after getting information used, and eliminate redundancy information obtains the information data that described redundance is lower.The method is carried out LSH according to the data message acquired in real network and is carried out similarity comparison and obtain comparing result, and the comparing result looked into double recipe formula with adopting traditional whole piece data in prior art and obtain, its computing velocity is faster, similarity comparison is more accurate.

Crawl information fusion assorting process as shown in Figure 3, idiographic flow comprises:

Step 301: aggregation platform is used for providing reptile task related command, controls the concrete state of reptile task.According to customization specific requirement, setting crawls scope in detail, dynamically follows the tracks of crawling state.

Particularly, reptile information fusion management platform depends on physical connection and the message communication of each reptile node in network, and with the intercommunication of database.Management platform and reptile node are served by http to realize, and port depends on the port that container is opened.The information such as reptile node write state are to database, and management platform obtains data from database, and therefore, they need to communicate with database, connect.

Reptile information office and management platform: management reptile, controls reptile state;

Node: each crawls a distributed deployment position;

Database: store and crawl information;

Interface: show and crawl information.

Step 302: data step 301 obtained carry out pre-service.According to customization item, aggregation platform obtains first run data from database.According to crawling the time, and reptile task status carries out the fuzzy displaying of the page to crawled content, for the cohesively managed page provides data.

Step 303: aggregated data becomes final pending data.So-called polymerization and data are by the data mode after overall treatment herein, are a kind of data processing of routine, carry out integrating and can carry out the data mode of successful presentation with centralization after step described above to the data crawled.

Step 304: the pending data obtained above-mentioned 303 present to user in visual mode, user also can according to page option customization information needed.

Finally, result display module is divided into PC to hold and two kinds, handheld client end displaying item.PC holds the corresponding reference address of cuit, and user logs in, and specifically customizes operation, customizes and requires to be transmitted by step 301, upgrade reptile state, be connected, by step 303 return data successful presentation with database.Conventional web content analysis cannot regular exhibition information, by further investigation web page contents, based on data present the feature of Diversity of information, theme polytrope, acquisition content is integrated, realize the functions such as source web page information depth parallel search, according to keywords search of the same type, according to potential applications, involved related content is sorted out, accurately can sort to data according to information timeliness simultaneously.Handheld device principle is held identical with PC.Result is shown and is also comprised page layout optimization function, and user can select focus or update time to carry out displayed page sequence for institute's customized content.

To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. based on a content polymerization process for distributed network reptile, it is characterized in that, the method comprises following process:

2. a kind of content polymerization process based on distributed network reptile according to claim 1, is characterized in that,

3. a kind of content polymerization process based on distributed network reptile according to claim 1, is characterized in that, described deployment reptile comprises the service configuration of reptile and task configuration.

4. a kind of content polymerization process based on distributed network reptile according to claim 1, is characterized in that,

Described step one specifically comprises:

5. a kind of content polymerization process based on distributed network reptile according to claim 3, it is characterized in that, described step one also comprises step 1.3, when proceeding to described step 1.2, if be provided with more detailed classification to information in the described destination address page, enter each group page and obtain detailed label, crawl particular content, and repeat step 1.3, carry particular content until crawl.

6. a kind of content polymerization process based on distributed network reptile according to claim 3, is characterized in that,

Described step 2 specifically comprises: