CN105956069A

CN105956069A - Network information collection and analysis method and network information collection and analysis system

Info

Publication number: CN105956069A
Application number: CN201610277727.3A
Authority: CN
Inventors: 吴斌; 谢晓勇; 黄�俊; 胡春华; 陈志雄; 胡浩
Original assignee: Up Wealth Management Co ltd
Current assignee: Up Wealth Management Co ltd
Priority date: 2016-04-28
Filing date: 2016-04-28
Publication date: 2016-09-21

Abstract

The invention provides a network information collection and analysis method and a network information collection and analysis system, and the method comprises the following steps: S1: assigning capturing tasks for preset capturing nodes, wherein the capturing tasks are at least corresponding to a network address of one target network; S2: receiving the captured data information sent by the capturing nodes, extracting target data according to the data information and storing the target data in a target database, wherein the target data comprise a title, a source, a release time and a body content about the data information; S3: judging degree of repetition of the current target data with other target data in the target database. According to the invention, an intelligent technical mode is used for collecting information, thus, bottleneck of manual information collection processing is broken greatly; moreover, programs are provided for automatically building a key information index of information, and a solid foundation is built for information big data application.

Description

The collection of a kind of network information and analysis method and system

Technical field

The present invention relates to internet data crawler technology field, particularly relate to a kind of net based on the Internet The collection of network information and analysis method and system.

Background technology

The existing public obtains the mode of finance and economics information and is gradually converted into from modes such as newspaper, broadcast, TVs With the Internet as main way.And network information is through the propagation of the Internet, make it by separate sources Produce power of influence and the transmissibility of persistence.

Current most network information processes and all passes through manual type maintenance and management, passes through manual type Maintenance and management for the promptness of network information and the utilization etc. of network information all can have many not Foot.Therefore, utilize manual type to the collection of finance and economics information and arrangement, consume plenty of time and experience also And effect does not reaches intended ideal.

Below new the Internet situation, these are perplexed, need the technological means by the Internet and mode, Quickly carrying out propagation time and the range detection of network information, the information improving interconnection Update on the net instructs Ability with content mining.

Summary of the invention

Problem to be solved by this invention is to provide a kind of can excavation for the degree of depth of network information and applies offer The collection of the network information that mathematical logic is supported and analysis method and system.

In order to solve above-mentioned technical problem, the invention provides following technical scheme:

The collection of a kind of network information and the method for analysis, comprise the following steps:

S1: for default crawl node distribution crawl task, described crawl task at least corresponds to one The network address of objective network；

S2: receive the data message grabbed that described crawl node sends, and according to described data Information retrieval goes out target data and described target data is stored in target database, described target data bag Include: about title, source, issuing time and the body matter of described data message；

S3: judge current described target data and other target datas in described target database Multiplicity.

As preferably, described step S1 is further configured to, and the network according to being distributed in each crawl node is climbed The state of worm, distributes described crawl task.

As preferably, described step S1 farther includes:

S10: determine the seed amount of targeted website corresponding to each described crawl node；

S11: determine that the seed amount of the crawl that the web crawlers of described crawl node completed and being not fully complete is grabbed The seed amount taken；

S12: the seed amount being completed the web crawlers of described crawl node is ranked up from high to low；

S13: the seed of crawl task will be not fully complete respectively according to being sequentially assigned to each described crawl in S12 Node.

As preferably, step S3 farther includes:

S30: described target data is set up index, and described index is stored in index data base；

S31: the target data corresponding to relatively respectively indexing in described index data base, it is judged that each described target The multiplicity of data, and by the described target database of write corresponding for described multiplicity.

As preferably, according to the multiplicity of described target data, set up each number of targets existing and repeating data According to the corresponding relation with the time.

As preferably, described index includes the keyword in described target data and key word.

As preferably, the index included in described index data base being associated in described target database is believed Breath.

As preferably, described objective network is the network of relation of finance and economics information.

Present invention also offers collection and the analysis system of a kind of network information, it applies net as above The collection of network information and the method for analysis, and described system includes:

Task allocating module, it is the crawl node distribution crawl task preset, and described crawl task is at least The network address corresponding to an objective network；

Handling module, it receives and performs described crawl task；

Extraction module, the data message that its each crawl node received in described handling module is grabbed, And extract target data according to described data message, and described target data is stored in target database, Described target data includes: about title, source, issuing time and the body matter of described data message；

Analyzing module, it judges current described number of targets according to the data message that described extraction module extracts According to the multiplicity with other target datas in described target database.

As preferably, described analysis module, farther include:

Unit set up in index, and described target data is set up index by it, and described index is stored in index number According in storehouse；

Multiplicity judging unit, it is based on the target data corresponding to the described index set up, it is judged that each institute State the multiplicity of target data, and by the described target database of write corresponding for described multiplicity.

The beneficial effects of the present invention is: the present invention takes the technical approach of intelligence to gather information, greatly Breaching the bottleneck of artificial information acquisition process, also Automatic Program sets up information key message index, for The big market demand of information sets up solid foundation.

Accompanying drawing explanation

Fig. 1 is collection and the flow chart of the method for analysis of the network information in the embodiment of the present invention；

Fig. 2 is collection and the theory diagram of the system of analysis of the network information in the embodiment of the present invention.

Description of reference numerals

1-task allocating module 2-handling module

3-extraction module 4-analyzes module

Detailed description of the invention

Below, in conjunction with accompanying drawing, embodiments of the invention are described in more details, but not as this Bright restriction.

The invention provides the collection of a kind of network information and analyze method and system, the method for the present invention can It is analyzed with the data realizing automatically the crawl node in network being captured, and sets up relevant rope Draw, it is possible to analyze the multiplicity of data corresponding to this index and the relation between the time, for number According to excavation provide powerful background support.

As it is shown in figure 1, be collection and the stream of the method for analysis of a kind of network information in the embodiment of the present invention Cheng Tu, including following steps:

S1: arrange platform by instrument, for default crawl node distribution crawl task, described crawl is appointed Business at least corresponds to the network address of an objective network；This network address can be about financial information Station address.

S2: receive the data message grabbed that each crawl node sends, and carry according to this data message Taking out target data, and described target data be stored in target database, described target data includes: Title, source, issuing time and body matter about data message；This data message also may be used simultaneously To include the seed of captured website and to be not fully complete the website seed information of crawl task.

S3: judge the multiplicity of current target data and other target datas in described target database. This multiplicity can include the repetition about title, the repetition of web site contents, or the repetition in source, And this multiplicity can by above-mentioned multiple in the case of to repeat comprehensive computing above-mentioned to obtain embodying The multiplicity of all of duplicate contents.

By above-mentioned configuration, embodiments of the invention can be by adding up the letter issued in each related web site The dependency of breath, and it is concluded that multiplicity content, with focus or the temperature of analysing content.

It addition, step S1 can also be further configured to, according to the web crawlers being distributed in each crawl node State, distribute described crawl task.It is to say, can be according to the web crawlers of each network node Task completion status or idle condition distribute crawl task, to equalize the duty of each network node. Concrete, step S1 in the embodiment of the present invention may further include:

S10: determine the seed amount of targeted website corresponding to each crawl node, i.e. determine each crawl node The general assignment amount of web crawlers；

By above-mentioned configuration, the crawl task of each web crawlers in network node can be calculated automatically Performance, and again distribute task according to the sequence of this performance, the effect that task captures can be improved Rate, it is also possible to improve the effect of cooperating of each network node, to complete crawl task fast and effectively.

It addition, step S3 in the present embodiment can further include:

That is by the way of foundation indexes, more efficient duplicate contents or key are quickly found The multiplicity of content, it is also possible to facilitate transferring and reading of data message.

Wherein, multiplicity according to described target data in the present embodiment, set up each existence and repeat data Target data and the corresponding relation of time.I.e. can set up each data message or the target with duplicate message The time shaft relation of data and respective issuing time, it is possible to this relation is deposited into target database In.Index in the present embodiment can include the keyword in described target data and key word, and institute State the index information included in described index data base being associated in target database.Pass through target data Storehouse and the relatedness of index data base, quickly correspondence can find relevant data message, with quickly Realize the reading of information and lookup and contrast.

Present invention also offers collection and the analysis system of a kind of network information, this system applies as above real Execute collection and the method for analysis of network information described in example, and as in figure 2 it is shown, be that the present invention is real The collection of the network information in executing and analysis system may include that task allocating module 1, handling module 2, Extraction module 3 and analysis module 4, wherein, task allocating module 1 can be that the crawl node preset divides Joining crawl task, described crawl task at least corresponds to the network address of an objective network；Handling module 2 can receive and perform described crawl task, and this handling module 2 includes the net being arranged on each network node Network reptile.Extraction module 3 can receive the data letter that each crawl node in handling module 2 is grabbed Breath, and extract target data according to this data message, and this target data is stored in target database, Described target data includes: about title, source, issuing time and the body matter of described data message. It addition, analyze module 4 can judge current target data according to the data message that extraction module 3 extracts Multiplicity with other target datas in target database.

Based on above-mentioned configuration, the system of the present embodiment can be by adding up the letter issued in each related web site The dependency of breath, and it is concluded that multiplicity content, with focus or the temperature of analysing content.

It addition, the present embodiment can also include computing module and order module, this computing module by based on Calculate the seed amount of targeted website corresponding to each crawl node, i.e. determine that the network of each crawl node is climbed The general assignment amount of worm；The seed of the crawl that order module is completed for determining the web crawlers capturing node Quantity and the seed amount being not fully complete crawl, task allocating module then will be not fully complete the kind of crawl task simultaneously Son is sequentially assigned to each described crawl node according to what order module arranged respectively.

It addition, the analysis module 4 in the present embodiment can further include: unit 41 set up in index With multiplicity judging unit 42, unit 41 set up in this index can set up index for target data, and This index is stored in index data base；

Multiplicity judging unit 42 can be based on the target data corresponding to the described index set up, it is judged that each The multiplicity of described target data, and by the described target database of write corresponding for described multiplicity.Also That is carry out more efficient duplicate contents or the key content of quickly finding by the way of foundation indexes Multiplicity, it is also possible to facilitate transferring and reading of data message.

Above example is only the exemplary embodiment of the present invention, is not used in the restriction present invention, the present invention's Protection domain is defined by the claims.Those skilled in the art can be at the essence of the present invention and protection model In enclosing, the present invention making various amendment or equivalent, this amendment or equivalent also should be regarded as Within the scope of the present invention.

Claims

1. a network information collection and analyze method, it is characterised in that comprise the following steps:

Method the most according to claim 1, it is characterised in that described step S1 configures further For, according to the state of the web crawlers being distributed in each crawl node, distribute described crawl task.

Method the most according to claim 2, it is characterised in that described step S1 farther includes:

Method the most according to claim 1, it is characterised in that step S3 farther includes:

Method the most according to claim 4, it is characterised in that according to the repetition of described target data Degree, sets up each corresponding relation that there is target data and the time repeating data.

Method the most according to claim 4, it is characterised in that described index includes described number of targets Keyword according to and key word.

Method the most according to claim 4, it is characterised in that be associated in described target database The index information included in described index data base.

Method the most according to claim 1, it is characterised in that described objective network is finance and economics information Network of relation.

9. network information collection and analyze a system, its application such as any one in claim 1-8 The collection of described network information and the method for analysis, and described system includes:

Handling module, it receives and performs described crawl task；

System the most according to claim 9, it is characterised in that described analysis module, further Including: