CN105956069A - Network information collection and analysis method and network information collection and analysis system - Google Patents

Network information collection and analysis method and network information collection and analysis system Download PDF

Info

Publication number
CN105956069A
CN105956069A CN201610277727.3A CN201610277727A CN105956069A CN 105956069 A CN105956069 A CN 105956069A CN 201610277727 A CN201610277727 A CN 201610277727A CN 105956069 A CN105956069 A CN 105956069A
Authority
CN
China
Prior art keywords
data
crawl
index
target data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610277727.3A
Other languages
Chinese (zh)
Inventor
吴斌
谢晓勇
黄�俊
胡春华
陈志雄
胡浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Up Wealth Management Co ltd
Original Assignee
Up Wealth Management Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Up Wealth Management Co ltd filed Critical Up Wealth Management Co ltd
Priority to CN201610277727.3A priority Critical patent/CN105956069A/en
Publication of CN105956069A publication Critical patent/CN105956069A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a network information collection and analysis method and a network information collection and analysis system, and the method comprises the following steps: S1: assigning capturing tasks for preset capturing nodes, wherein the capturing tasks are at least corresponding to a network address of one target network; S2: receiving the captured data information sent by the capturing nodes, extracting target data according to the data information and storing the target data in a target database, wherein the target data comprise a title, a source, a release time and a body content about the data information; S3: judging degree of repetition of the current target data with other target data in the target database. According to the invention, an intelligent technical mode is used for collecting information, thus, bottleneck of manual information collection processing is broken greatly; moreover, programs are provided for automatically building a key information index of information, and a solid foundation is built for information big data application.

Description

The collection of a kind of network information and analysis method and system
Technical field
The present invention relates to internet data crawler technology field, particularly relate to a kind of net based on the Internet The collection of network information and analysis method and system.
Background technology
The existing public obtains the mode of finance and economics information and is gradually converted into from modes such as newspaper, broadcast, TVs With the Internet as main way.And network information is through the propagation of the Internet, make it by separate sources Produce power of influence and the transmissibility of persistence.
Current most network information processes and all passes through manual type maintenance and management, passes through manual type Maintenance and management for the promptness of network information and the utilization etc. of network information all can have many not Foot.Therefore, utilize manual type to the collection of finance and economics information and arrangement, consume plenty of time and experience also And effect does not reaches intended ideal.
Below new the Internet situation, these are perplexed, need the technological means by the Internet and mode, Quickly carrying out propagation time and the range detection of network information, the information improving interconnection Update on the net instructs Ability with content mining.
Summary of the invention
Problem to be solved by this invention is to provide a kind of can excavation for the degree of depth of network information and applies offer The collection of the network information that mathematical logic is supported and analysis method and system.
In order to solve above-mentioned technical problem, the invention provides following technical scheme:
The collection of a kind of network information and the method for analysis, comprise the following steps:
S1: for default crawl node distribution crawl task, described crawl task at least corresponds to one The network address of objective network;
S2: receive the data message grabbed that described crawl node sends, and according to described data Information retrieval goes out target data and described target data is stored in target database, described target data bag Include: about title, source, issuing time and the body matter of described data message;
S3: judge current described target data and other target datas in described target database Multiplicity.
As preferably, described step S1 is further configured to, and the network according to being distributed in each crawl node is climbed The state of worm, distributes described crawl task.
As preferably, described step S1 farther includes:
S10: determine the seed amount of targeted website corresponding to each described crawl node;
S11: determine that the seed amount of the crawl that the web crawlers of described crawl node completed and being not fully complete is grabbed The seed amount taken;
S12: the seed amount being completed the web crawlers of described crawl node is ranked up from high to low;
S13: the seed of crawl task will be not fully complete respectively according to being sequentially assigned to each described crawl in S12 Node.
As preferably, step S3 farther includes:
S30: described target data is set up index, and described index is stored in index data base;
S31: the target data corresponding to relatively respectively indexing in described index data base, it is judged that each described target The multiplicity of data, and by the described target database of write corresponding for described multiplicity.
As preferably, according to the multiplicity of described target data, set up each number of targets existing and repeating data According to the corresponding relation with the time.
As preferably, described index includes the keyword in described target data and key word.
As preferably, the index included in described index data base being associated in described target database is believed Breath.
As preferably, described objective network is the network of relation of finance and economics information.
Present invention also offers collection and the analysis system of a kind of network information, it applies net as above The collection of network information and the method for analysis, and described system includes:
Task allocating module, it is the crawl node distribution crawl task preset, and described crawl task is at least The network address corresponding to an objective network;
Handling module, it receives and performs described crawl task;
Extraction module, the data message that its each crawl node received in described handling module is grabbed, And extract target data according to described data message, and described target data is stored in target database, Described target data includes: about title, source, issuing time and the body matter of described data message;
Analyzing module, it judges current described number of targets according to the data message that described extraction module extracts According to the multiplicity with other target datas in described target database.
As preferably, described analysis module, farther include:
Unit set up in index, and described target data is set up index by it, and described index is stored in index number According in storehouse;
Multiplicity judging unit, it is based on the target data corresponding to the described index set up, it is judged that each institute State the multiplicity of target data, and by the described target database of write corresponding for described multiplicity.
The beneficial effects of the present invention is: the present invention takes the technical approach of intelligence to gather information, greatly Breaching the bottleneck of artificial information acquisition process, also Automatic Program sets up information key message index, for The big market demand of information sets up solid foundation.
Accompanying drawing explanation
Fig. 1 is collection and the flow chart of the method for analysis of the network information in the embodiment of the present invention;
Fig. 2 is collection and the theory diagram of the system of analysis of the network information in the embodiment of the present invention.
Description of reference numerals
1-task allocating module 2-handling module
3-extraction module 4-analyzes module
Detailed description of the invention
Below, in conjunction with accompanying drawing, embodiments of the invention are described in more details, but not as this Bright restriction.
The invention provides the collection of a kind of network information and analyze method and system, the method for the present invention can It is analyzed with the data realizing automatically the crawl node in network being captured, and sets up relevant rope Draw, it is possible to analyze the multiplicity of data corresponding to this index and the relation between the time, for number According to excavation provide powerful background support.
As it is shown in figure 1, be collection and the stream of the method for analysis of a kind of network information in the embodiment of the present invention Cheng Tu, including following steps:
S1: arrange platform by instrument, for default crawl node distribution crawl task, described crawl is appointed Business at least corresponds to the network address of an objective network;This network address can be about financial information Station address.
S2: receive the data message grabbed that each crawl node sends, and carry according to this data message Taking out target data, and described target data be stored in target database, described target data includes: Title, source, issuing time and body matter about data message;This data message also may be used simultaneously To include the seed of captured website and to be not fully complete the website seed information of crawl task.
S3: judge the multiplicity of current target data and other target datas in described target database. This multiplicity can include the repetition about title, the repetition of web site contents, or the repetition in source, And this multiplicity can by above-mentioned multiple in the case of to repeat comprehensive computing above-mentioned to obtain embodying The multiplicity of all of duplicate contents.
By above-mentioned configuration, embodiments of the invention can be by adding up the letter issued in each related web site The dependency of breath, and it is concluded that multiplicity content, with focus or the temperature of analysing content.
It addition, step S1 can also be further configured to, according to the web crawlers being distributed in each crawl node State, distribute described crawl task.It is to say, can be according to the web crawlers of each network node Task completion status or idle condition distribute crawl task, to equalize the duty of each network node. Concrete, step S1 in the embodiment of the present invention may further include:
S10: determine the seed amount of targeted website corresponding to each crawl node, i.e. determine each crawl node The general assignment amount of web crawlers;
S11: determine that the seed amount of the crawl that the web crawlers of described crawl node completed and being not fully complete is grabbed The seed amount taken;
S12: the seed amount being completed the web crawlers of described crawl node is ranked up from high to low;
S13: the seed of crawl task will be not fully complete respectively according to being sequentially assigned to each described crawl in S12 Node.
By above-mentioned configuration, the crawl task of each web crawlers in network node can be calculated automatically Performance, and again distribute task according to the sequence of this performance, the effect that task captures can be improved Rate, it is also possible to improve the effect of cooperating of each network node, to complete crawl task fast and effectively.
It addition, step S3 in the present embodiment can further include:
S30: described target data is set up index, and described index is stored in index data base;
S31: the target data corresponding to relatively respectively indexing in described index data base, it is judged that each described target The multiplicity of data, and by the described target database of write corresponding for described multiplicity.
That is by the way of foundation indexes, more efficient duplicate contents or key are quickly found The multiplicity of content, it is also possible to facilitate transferring and reading of data message.
Wherein, multiplicity according to described target data in the present embodiment, set up each existence and repeat data Target data and the corresponding relation of time.I.e. can set up each data message or the target with duplicate message The time shaft relation of data and respective issuing time, it is possible to this relation is deposited into target database In.Index in the present embodiment can include the keyword in described target data and key word, and institute State the index information included in described index data base being associated in target database.Pass through target data Storehouse and the relatedness of index data base, quickly correspondence can find relevant data message, with quickly Realize the reading of information and lookup and contrast.
Present invention also offers collection and the analysis system of a kind of network information, this system applies as above real Execute collection and the method for analysis of network information described in example, and as in figure 2 it is shown, be that the present invention is real The collection of the network information in executing and analysis system may include that task allocating module 1, handling module 2, Extraction module 3 and analysis module 4, wherein, task allocating module 1 can be that the crawl node preset divides Joining crawl task, described crawl task at least corresponds to the network address of an objective network;Handling module 2 can receive and perform described crawl task, and this handling module 2 includes the net being arranged on each network node Network reptile.Extraction module 3 can receive the data letter that each crawl node in handling module 2 is grabbed Breath, and extract target data according to this data message, and this target data is stored in target database, Described target data includes: about title, source, issuing time and the body matter of described data message. It addition, analyze module 4 can judge current target data according to the data message that extraction module 3 extracts Multiplicity with other target datas in target database.
Based on above-mentioned configuration, the system of the present embodiment can be by adding up the letter issued in each related web site The dependency of breath, and it is concluded that multiplicity content, with focus or the temperature of analysing content.
It addition, the present embodiment can also include computing module and order module, this computing module by based on Calculate the seed amount of targeted website corresponding to each crawl node, i.e. determine that the network of each crawl node is climbed The general assignment amount of worm;The seed of the crawl that order module is completed for determining the web crawlers capturing node Quantity and the seed amount being not fully complete crawl, task allocating module then will be not fully complete the kind of crawl task simultaneously Son is sequentially assigned to each described crawl node according to what order module arranged respectively.
By above-mentioned configuration, the crawl task of each web crawlers in network node can be calculated automatically Performance, and again distribute task according to the sequence of this performance, the effect that task captures can be improved Rate, it is also possible to improve the effect of cooperating of each network node, to complete crawl task fast and effectively.
It addition, the analysis module 4 in the present embodiment can further include: unit 41 set up in index With multiplicity judging unit 42, unit 41 set up in this index can set up index for target data, and This index is stored in index data base;
Multiplicity judging unit 42 can be based on the target data corresponding to the described index set up, it is judged that each The multiplicity of described target data, and by the described target database of write corresponding for described multiplicity.Also That is carry out more efficient duplicate contents or the key content of quickly finding by the way of foundation indexes Multiplicity, it is also possible to facilitate transferring and reading of data message.
Above example is only the exemplary embodiment of the present invention, is not used in the restriction present invention, the present invention's Protection domain is defined by the claims.Those skilled in the art can be at the essence of the present invention and protection model In enclosing, the present invention making various amendment or equivalent, this amendment or equivalent also should be regarded as Within the scope of the present invention.

Claims (10)

1. a network information collection and analyze method, it is characterised in that comprise the following steps:
S1: for default crawl node distribution crawl task, described crawl task at least corresponds to one The network address of objective network;
S2: receive the data message grabbed that described crawl node sends, and according to described data Information retrieval goes out target data and described target data is stored in target database, described target data bag Include: about title, source, issuing time and the body matter of described data message;
S3: judge current described target data and other target datas in described target database Multiplicity.
Method the most according to claim 1, it is characterised in that described step S1 configures further For, according to the state of the web crawlers being distributed in each crawl node, distribute described crawl task.
Method the most according to claim 2, it is characterised in that described step S1 farther includes:
S10: determine the seed amount of targeted website corresponding to each described crawl node;
S11: determine that the seed amount of the crawl that the web crawlers of described crawl node completed and being not fully complete is grabbed The seed amount taken;
S12: the seed amount being completed the web crawlers of described crawl node is ranked up from high to low;
S13: the seed of crawl task will be not fully complete respectively according to being sequentially assigned to each described crawl in S12 Node.
Method the most according to claim 1, it is characterised in that step S3 farther includes:
S30: described target data is set up index, and described index is stored in index data base;
S31: the target data corresponding to relatively respectively indexing in described index data base, it is judged that each described target The multiplicity of data, and by the described target database of write corresponding for described multiplicity.
Method the most according to claim 4, it is characterised in that according to the repetition of described target data Degree, sets up each corresponding relation that there is target data and the time repeating data.
Method the most according to claim 4, it is characterised in that described index includes described number of targets Keyword according to and key word.
Method the most according to claim 4, it is characterised in that be associated in described target database The index information included in described index data base.
Method the most according to claim 1, it is characterised in that described objective network is finance and economics information Network of relation.
9. network information collection and analyze a system, its application such as any one in claim 1-8 The collection of described network information and the method for analysis, and described system includes:
Task allocating module, it is the crawl node distribution crawl task preset, and described crawl task is at least The network address corresponding to an objective network;
Handling module, it receives and performs described crawl task;
Extraction module, the data message that its each crawl node received in described handling module is grabbed, And extract target data according to described data message, and described target data is stored in target database, Described target data includes: about title, source, issuing time and the body matter of described data message;
Analyzing module, it judges current described number of targets according to the data message that described extraction module extracts According to the multiplicity with other target datas in described target database.
System the most according to claim 9, it is characterised in that described analysis module, further Including:
Unit set up in index, and described target data is set up index by it, and described index is stored in index number According in storehouse;
Multiplicity judging unit, it is based on the target data corresponding to the described index set up, it is judged that each institute State the multiplicity of target data, and by the described target database of write corresponding for described multiplicity.
CN201610277727.3A 2016-04-28 2016-04-28 Network information collection and analysis method and network information collection and analysis system Pending CN105956069A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610277727.3A CN105956069A (en) 2016-04-28 2016-04-28 Network information collection and analysis method and network information collection and analysis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610277727.3A CN105956069A (en) 2016-04-28 2016-04-28 Network information collection and analysis method and network information collection and analysis system

Publications (1)

Publication Number Publication Date
CN105956069A true CN105956069A (en) 2016-09-21

Family

ID=56916814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610277727.3A Pending CN105956069A (en) 2016-04-28 2016-04-28 Network information collection and analysis method and network information collection and analysis system

Country Status (1)

Country Link
CN (1) CN105956069A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273534A (en) * 2017-06-29 2017-10-20 武汉楚鼎信息技术有限公司 A kind of data processing method extracted based on information content, system
CN107729449A (en) * 2017-10-09 2018-02-23 广州市万表科技股份有限公司 A kind of Web content crawl methods of exhibiting and platform
WO2019079992A1 (en) * 2017-10-25 2019-05-02 麦格创科技(深圳)有限公司 Task manager allocation method in distributed crawler system, and system
CN110309403A (en) * 2018-03-05 2019-10-08 百度在线网络技术(北京)有限公司 Method and apparatus for grabbing data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040128285A1 (en) * 2000-12-15 2004-07-01 Jacob Green Dynamic-content web crawling through traffic monitoring
CN103488709A (en) * 2013-09-09 2014-01-01 东软集团股份有限公司 Method and system for building indexes and method and system for retrieving indexes
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN104516956A (en) * 2014-12-16 2015-04-15 中国科学院声学研究所 Incremental crawling method for website information
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040128285A1 (en) * 2000-12-15 2004-07-01 Jacob Green Dynamic-content web crawling through traffic monitoring
CN103488709A (en) * 2013-09-09 2014-01-01 东软集团股份有限公司 Method and system for building indexes and method and system for retrieving indexes
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN104516956A (en) * 2014-12-16 2015-04-15 中国科学院声学研究所 Incremental crawling method for website information
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273534A (en) * 2017-06-29 2017-10-20 武汉楚鼎信息技术有限公司 A kind of data processing method extracted based on information content, system
CN107729449A (en) * 2017-10-09 2018-02-23 广州市万表科技股份有限公司 A kind of Web content crawl methods of exhibiting and platform
WO2019079992A1 (en) * 2017-10-25 2019-05-02 麦格创科技(深圳)有限公司 Task manager allocation method in distributed crawler system, and system
CN110309403A (en) * 2018-03-05 2019-10-08 百度在线网络技术(北京)有限公司 Method and apparatus for grabbing data

Similar Documents

Publication Publication Date Title
Coscia et al. Demon: a local-first discovery method for overlapping communities
Ediger et al. Tracking structure of streaming social networks
CN104735138A (en) Distributed acquisition method and system oriented to user generated content
CN103778148B (en) Life cycle management method and equipment for data file of Hadoop distributed file system
CN106709012A (en) Method and device for analyzing big data
CN105956069A (en) Network information collection and analysis method and network information collection and analysis system
CN104951512A (en) Public sentiment data collection method and system based on Internet
CN112165462A (en) Attack prediction method and device based on portrait, electronic equipment and storage medium
CN105959372A (en) Internet user data analysis method based on mobile application
CN105468744A (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN103678609A (en) Large data inquiring method based on distribution relation-object mapping processing
CN104462222A (en) Distributed storage method and system for checkpoint vehicle pass data
CN103631922A (en) Hadoop cluster-based large-scale Web information extraction method and system
CN102968494A (en) System and method for acquiring traffic information by microblog
CN102521706A (en) KPI data analysis method and device for the same
CN105404644A (en) Public opinion information processing method and system
CN103853838A (en) Data processing method and device
Saran et al. A comprehensive review on biodiversity information portals
Acosta et al. City safety perception model based on visual content of street images
Yang et al. A learning-to-rank algorithm for constructing defect prediction models
Inoue et al. Analysis of cooperative research and development networks on Japanese patents
CN107832451A (en) A kind of big data cleaning way of simplification
CN102902739B (en) Towards the workflow view building method in uncertain data source under cloud computing environment
CN204731786U (en) Adopt the large data analysis system of computing machine verification code technology
Lee et al. A git source repository analysis tool based on a novel branch-oriented approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160921

RJ01 Rejection of invention patent application after publication