CN106126705A

CN106126705A - A kind of large scale network data crawl system in real time

Info

Publication number: CN106126705A
Application number: CN201610507120.XA
Authority: CN
Inventors: 刘丽君; 李成华
Original assignee: WUHAN TIPDM INTELLIGENT TECHNOLOGY Co Ltd
Current assignee: WUHAN TIPDM INTELLIGENT TECHNOLOGY Co Ltd
Priority date: 2016-07-01
Filing date: 2016-07-01
Publication date: 2016-11-16

Abstract

A kind of large scale network data crawl system in real time, and initialization seed optimizes module, for the kind sublink of typing website；Periodically qualified web page interlinkage is joined in seed set, as the set of initial seed；Integrate module, for the web document of HTML is obtained, and the information in text is labeled；Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage；Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, represents the information of degree of association with concrete numerical information；Hyperlink importance degree computing module, the numerical information being used for calculating, as the foundation judging degree of association, is also to carry out quantitative analysis by concrete numerical value；If the number of links that current page is comprised has reached certain numerical value, represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource has met preset requirement.

Description

A kind of large scale network data crawl system in real time

Technical field

The present invention relates to big data field of cloud computer technology, crawl in real time particularly to a kind of large scale network data and be System.

Background technology

Developing rapidly and becoming increasingly popular along with the Internet, the content that network information platform can be provided by is abundant all the more many Coloured silk, user search for information needed time face search difficulty increase and information sifting needed for consume plenty of time and energy also with ?.The appearance of search engine solves a difficult problem for magnanimity information retrieval.Search engine carries out the collection of resource by reptile. Web crawlers carries out crawling and collecting of web document by network connection, i.e. starts with from previously given URL, utilizes H1vrP Agreement crawls required html document, and analyzes the hyperlink included in these html documents, again captures the chain not accessed The resource connect and comprise.So repeatedly until there is no new URL.

But due to the fast development of mobile Internet, the newest web page contents presents explosive growth, and traditional climbs Take system and cannot meet the demand that large scale network data crawl.

Summary of the invention

Therefore, it is necessary to provide a kind of can crawl in real time large scale network data large scale network data climb in real time Take system.

A kind of large scale network data crawl system in real time, and it includes such as lower module:

Initialization seed optimizes module, for the kind sublink of typing website；By the way of Meta Search Engine, by optimum result Feed back to user；Excavate and link link forward with thematic relation degree；Periodically qualified web page interlinkage is joined seed In set, as the set of initial seed；

Integrate module, for the web document of HTML is obtained, and the information in text is labeled；

Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if having 2 The individual page A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high, If be simultaneously directed to A and B2 hyperlink when of user's Query Information simultaneously, then the information quality being defaulted as A with B is identical；

Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, with concrete numerical value letter Breath represents the information of degree of association；

Hyperlink importance degree computing module, for depending on the numerical information calculated as one that judges degree of association According to, also it is to carry out quantitative analysis by concrete numerical value；If the number of links that current page is comprised has reached certain numerical value, Represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource meets Preset requirement.

Crawl in real time in system in large scale network data of the present invention, described web pages relevance computing module bag Include:

Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent；Circulation searching, Condition is to find corresponding marker character, and marker character is defined as delimiters；Search function interception position 1, search function defines For Find ()；Search function interception position 2, uses same search function；Interception position 1,2 also exports character string, character String is defined as dest；Traversal terminates, output string；

Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and by pure Text classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated；In the acquisition of information theme phase excavated The algorithm vector space model of Guan Du represents.

Crawl in real time in system in large scale network data of the present invention,

Vector space model is expressed as follows:

First analyze the text message of Webpage, define α=(w here₁, w₂... w_n), i=l, 2 ... n,

Number of times key word occur is added up, key word localization criteria the highest for the frequency of occurrences, here frequency It is defined as x_i, build a vector x_iw_i, and define the vectorial β=(x of page subject matter₁w₁,x₂w₂,…x_nw_n), i=1,2, ... n,；Then two vectorial cosine functions just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:

The angle of two of which vector The biggest, represent that frequency is the least, show the least with the degree of association of theme；Angle is the least represents that the frequency occurred is the biggest, illustrates with main The degree of association of topic is higher；

The threshold value of current web page and degree of subject relativity is set；Represent relevant to theme more than threshold value, otherwise with theme not phase Close, classification is carried out for the webpage relevant to theme and preserves, be submitted to Database index data.

Crawl in real time in system in large scale network data of the present invention, current web page and degree of subject relativity are set Threshold value includes:

Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and being correlated with by manual analysis webpage Property, and calculate accuracy rate；

Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then drops Low threshold is used for improving reptile coverage rate；If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then Improve threshold value for improving the accuracy rate of reptile；

Repeat and accuracy rate is added up repeatedly until obtaining the threshold value wanted.

Crawling in real time in system in large scale network data of the present invention, hyperlink importance degree computing module includes:

As follows to the computing formula of page importance degree:

p_u=w₁*cos<α,β>+w₂* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL (u) represents the connection number location searched, its maximum C_maxRepresent；The weights of page degree of association Representing with m1, the weights m2 of page link degree represents；M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.

Crawling in real time in system in large scale network data of the present invention, described integration module also includes:

The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to Domain name of creeping can be run, be not belonging to run domain name of creeping and directly terminate, belong to and can run domain name of creeping and then obtain the territory of URL Name, and obtain the video parsing class corresponding with this domain name；Judge that video resolves whether class is empty, then terminates for sky, be not empty Continue to determine whether the broadcast address for video HTML, be not that broadcast address then terminates, be that broadcast address is then from URL and content Obtain video true download address list, when video true download address list is not empty, return the true download address of video List also terminates；When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration Module.

The large scale network data that implementing the present invention provides crawl system in real time and compared with prior art have following useful Effect: analyze degree of subject relativity with concrete numerical value by arranging web pages relevance computing module, with concrete numerical information Represent the information of degree of association；By hyperlink importance degree computing module using the numerical information calculated as judging degree of association A foundation, be also to carry out quantitative analysis by concrete numerical value；If the number of links that current page is comprised has reached one Fixed numerical value, represents that this page has several to link, if the number comprised has reached the master that default value represents comprised Topic resource has met preset requirement, it is possible to obtain the network data wanted in the big data of magnanimity, and by arranging integration mould The web document of HTML is obtained and analyses whether as video HTML by block, it is possible to distinguishes generic web page and video web-pages, is Crawl in hgher efficiency.

Accompanying drawing explanation

Fig. 1 is that the large scale network data of the embodiment of the present invention crawl system architecture diagram in real time.

Fig. 2 is the structured flowchart of web pages relevance computing module in Fig. 1.

Detailed description of the invention

As shown in Figure 1, 2, a kind of large scale network data crawl system in real time, and it includes such as lower module:

Initialization seed optimizes module, for the kind sublink of typing website；By the way of Meta Search Engine, by optimum result Feed back to user；Excavate and link link forward with thematic relation degree；Periodically qualified web page interlinkage is joined seed In set, as the set of initial seed.

Alternatively, initialization seed optimizes in module, arranges greatest priority queue, is safeguarded in greatest priority queue Set set in, corresponding priority key of each element in set.By the maximum priority queue following flow process of support:

Insert queue Insert (set, e, key): be inserted in set by the element e that priority is key；

Highest queue Max (set): return the element that set set medium priority is the highest；

Extract queue Ext (set): return the element that in set set, priority is the highest, and it deleted from set；

It is incremented by queue (set, e, key): the priority of element e in set set is set to key.

By the present embodiment, can be realized by raft, there is the highest efficiency.

Integrate module, for the web document of HTML is obtained, and the information in text is labeled.

Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if having 2 The individual page A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high, If be simultaneously directed to A and B2 hyperlink when of user's Query Information simultaneously, then the information quality being defaulted as A with B is identical.

Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, with concrete numerical value letter Breath represents the information of degree of association.

Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent；Circulation searching, Condition is to find corresponding marker character, and marker character is defined as delimiters；Search function interception position 1, search function defines For Find ()；Search function interception position 2, uses same search function；Interception position 1,2 also exports character string, character String is defined as dest；Traversal terminates, output string.

Vector space model is expressed as follows:

The angle of two of which vector The biggest, represent that frequency is the least, show the least with the degree of association of theme；Angle is the least represents that the frequency occurred is the biggest, illustrates with main The degree of association of topic is higher.

Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and being correlated with by manual analysis webpage Property, and calculate accuracy rate.

Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then drops Low threshold is used for improving reptile coverage rate；If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then Improve threshold value for improving the accuracy rate of reptile.

As follows to the computing formula of page importance degree:

It is understood that for the person of ordinary skill of the art, can conceive according to the technology of the present invention and do Go out other various corresponding changes and deformation, and all these change all should belong to the protection model of the claims in the present invention with deformation Enclose.

Claims

1. large scale network data crawl system in real time, it is characterised in that it includes such as lower module:

Initialization seed optimizes module, for the kind sublink of typing website；By the way of Meta Search Engine, by optimum result feedback To user；Excavate and link link forward with thematic relation degree；Periodically qualified web page interlinkage is joined seed set In, as the set of initial seed；

Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if there being 2 pages Face A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high, simultaneously If being simultaneously directed to A and B2 hyperlink when of user's Query Information, then the information quality being defaulted as A with B is identical；

Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, comes with concrete numerical information Represent the information of degree of association；

Hyperlink importance degree computing module, the numerical information being used for calculating is as the foundation judging degree of association, also It is to carry out quantitative analysis by concrete numerical value；If the number of links that current page is comprised has reached certain numerical value, represent This page has several to link, if the number comprised has reached default value, to represent that comprised subject resource has met pre- If requirement.

2. large scale network data as claimed in claim 1 crawl system in real time, it is characterised in that described web pages relevance meter Calculation module includes:

Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent；Circulation searching, condition Being to find corresponding marker character, marker character is defined as delimiters；Search function interception position 1, search function is defined as Find()；Search function interception position 2, uses same search function；Interception position 1,2 also exports character string, character string It is defined as dest；Traversal terminates, output string；

Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and passes through plain text Classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated；At the acquisition of information degree of subject relativity excavated Algorithm vector space model represent.

3. large scale network data as claimed in claim 2 crawl system in real time, it is characterised in that

Vector space model is expressed as follows:

First analyze the text message of Webpage, define α=(w here₁, w₂... w_n), i=l, 2 ... n, key word is gone out Existing number of times is added up, and key word localization criteria the highest for the frequency of occurrences, here frequency is defined as x_i, build one to Amount x_iw_i, and define the vectorial β=(x of page subject matter₁w₁,x₂w₂,…x_nw_n), i=1,2 ... n,；Then two vectorial cosine letters Number just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:

The angle of two of which vector is the biggest, Represent that frequency is the least, show the least with the degree of association of theme；Angle is the least represents that the frequency occurred is the biggest, and the phase with theme is described Close Du Genggao；

The threshold value of current web page and degree of subject relativity is set；Represent relevant to theme more than threshold value, otherwise uncorrelated with theme, right Carry out classification in the webpage relevant to theme to preserve, be submitted to Database index data.

4. large scale network data as claimed in claim 3 crawl system in real time, it is characterised in that arrange current web page and master The threshold value of topic degree of association includes:

Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and by the dependency of manual analysis webpage, and Calculate accuracy rate；

Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then reduces threshold Value is used for improving reptile coverage rate；If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then improve Threshold value is for improving the accuracy rate of reptile；

5. large scale network data as claimed in claim 4 crawl system in real time, it is characterised in that hyperlink importance degree calculates Module includes:

As follows to the computing formula of page importance degree:

p_u=w₁* cos < α, β ＞+w₂* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL U () represents the connection number location searched, its maximum C_maxRepresent；The weights m1 of page degree of association represents, page link The weights m2 of degree represents；M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.

6. large scale network data as claimed in claim 5 crawl system in real time, and described integration module also includes:

The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to and can transport Row is creeped domain name, is not belonging to run domain name of creeping and directly terminates, and belongs to and can run domain name of creeping and then obtain the domain name of URL, and Obtain the video corresponding with this domain name and resolve class；Judge that video resolves whether class is empty, then terminates for sky, does not continues to sentence for sky Whether disconnected be the broadcast address of video HTML, is not that broadcast address then terminates, is that broadcast address is then regarded from URL and content The true download address list of frequency, when video true download address list is not empty, returns video true download address list also Terminate；When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration module.