CN106844782A

CN106844782A - The multichannel big data acquisition system and method for a kind of network-oriented

Info

Publication number: CN106844782A
Application number: CN201710142262.5A
Authority: CN
Inventors: 朱世伟; 杨子江; 于俊凤; 李源; 冯海洲; 魏墨济; 王燕; 李思思; 张铭君; 王彦
Original assignee: INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES
Current assignee: China Southern Power Grid Internet Service Co ltd; Jingchuang United Beijing Intellectual Property Service Co ltd
Priority date: 2017-03-10
Filing date: 2017-03-10
Publication date: 2017-06-13
Anticipated expiration: 2037-03-10
Also published as: CN106844782B

Abstract

The invention discloses the multichannel big data acquisition system and method for network-oriented, wherein, the multichannel big data acquisition system of network-oriented is made up of forum data collecting unit, blog data collecting unit, news data collecting unit and data acquisition unit is constituted in relevant database distributed oriented acquisition architectural framework；Forum data collecting unit, for being acquired to the network data in online forum and offline forum by dynamic web page acquisition method and method for abstracting web page information respectively；Blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses；Real-time Collection is carried out to each corresponding blog in Feed addresses, the blog articles for updating are tracked, blog information is gathered in incremental update mode；News data collecting unit, for extracting the body text in news web page using the method based on row block distribution function；Data acquisition unit in relevant database, for using data transfer tool come data in batch capture relevant database.

Description

The multichannel big data acquisition system and method for a kind of network-oriented

Technical field

The invention belongs to network data processing field, more particularly to a kind of network-oriented multichannel big data acquisition system And method.

Background technology

It is natural resources, human resources strategic resource of equal importance that " big data " has become, and it is huge that it implies Society and economic worth have caused the great attention also of scientific and technological circle and enterprise.If effectively organized and use these big datas general Huge impetus is played to society and expanding economy.The data of these sharp increases mostly come from the daily life of people Living, particularly internet has become the maximum public information distribution centre of China and social groups' platform.With newspaper, radio broadcasting The communications media traditional with TV etc. is compared, the network media has that entry threshold is low, information is ultra-large, information issue with propagate Rapidly, the comprehensive features such as colony is huge, real-time, interactive is strong are participated in, have become society and politics, economic each field it is most quick, Extensive information channel.And how to find that useful information turns into the pass of government and every profession and trade in time from substantial amounts of internet data Heat injection point.

The characteristics of network data resource is big scale, and from different websites all over the world, distribution disperses, therefore, face How the network information and its diversified message form to magnanimity, recognize, extract the information of separate sources and form exactly, Efficiently, information is comprehensively gathered, in time the renewal of tracking information, the difficult point as big data acquisition, and it is big as the later stage The basis of data analysis accuracy.

The content of the invention

In order to solve the deficiencies in the prior art, the first object of the present invention provides a kind of multichannel big data of network-oriented Acquisition system.

A kind of multichannel big data acquisition system of network-oriented of the invention, the multichannel big data of the network-oriented Acquisition system is by forum data collecting unit, blog data collecting unit, news data collecting unit and relevant database The distributed oriented acquisition architectural framework composition that data acquisition unit is constituted；

The forum data collecting unit, for passing through dynamic web page acquisition method and method for abstracting web page information pair respectively Network data in online forum and offline forum is acquired；

The blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses； Real-time Collection is carried out to each corresponding blog in Feed addresses, the blog articles for updating are tracked, gathers rich in incremental update mode Objective information；

The news data collecting unit, for use in the method extraction news web page based on row block distribution function just Text, and then obtain news data；

Data acquisition unit in the relevant database, for using data transfer tool come batch capture relationship type number According to data in storehouse.

Further, in the forum data collecting unit, the network data in forum is entered by base unit of column Row collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four In the stage, web retrieval and info web are extracted combine and obtain network data in forum.

The present invention is be combined with each other by dynamic web page highly effective gathering technology and Web page information extraction technology, in real time, comprehensively, Accurately acquire the model and its relevant meta information specified in specified forum website in column.

Further, the blog data collecting unit is found that device and multiple information acquisition devices are constituted by a Feed, institute State Feed and find that device is used to obtain the URL addresses or RSS addresses of blog and the linking relationship by each blog page goes to obtain Take the URL addresses or RSS addresses of other blogs；The collector is used to carry out blog the refreshing collection of increment, and extracts The new blog article information delivered, generates corresponding blog article information record and is put in storage.

The present invention is capable of the blog data of Real-time Collection renewal so that data acquisition is in real time and accurate.

Further, the news data collecting unit includes webpage html source code pretreatment module, and it is used for webpage Html source code is encoded, goes script and spcial character to be processed；And

Format tags remove module, and it is used to be removed format tags to pretreated webpage html source code, obtains Coarse Web page text；And

Text extraction module, it is used to extract treating in coarse Web page text using the default distribution function for seeking row number of words Web page text is obtained, and then obtains news data.

The present invention intuitively can efficiently and accurately obtain news data.

Further, in data acquisition unit in the relevant database, the data transfer tool is Sqoop.

Sqoop is an instrument for the data in Hadoop and relevant database are mutually shifted, can be by one Number in individual relevant database (such as relevant database of any support JDBC specifications of MySQL, Oracle, Postgres) According in the HDFS for importing Hadoop.For some NoSQL databases, it also provides connector.Sqoop is similar to other ETL works Tool, data type is judged using metadata schema and the number of type safety is ensured when data are transferred to Hadoop from data source According to treatment.Sqoop aims at the design of big data bulk transfer, partitioned data set and can create Hadoop tasks and processes each area Block.

The second object of the present invention is to provide a kind of multichannel big data acquisition method of network-oriented.

A kind of multichannel big data acquisition method of network-oriented of the invention, it uses distributed oriented acquisition system frame Structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database；

Wherein, respectively by dynamic web page acquisition method and method for abstracting web page information in online forum and offline forum Network data be acquired；

During blog data, first, blog Feed addresses are obtained；Then, it is corresponding to each Feed address rich Visitor carries out Real-time Collection, tracks the blog articles for updating, and blog information is gathered in incremental update mode；

Body text in news web page is extracted using the method based on row block distribution function, and then obtains news data；

Using data transfer tool come data in batch capture relevant database.

Further, during collection forum data is adopted, the network data in forum is entered by base unit of column Row collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four In the stage, web retrieval and info web are extracted combine and obtain network data in forum.

Further, collection forum data adopt during, using Feed find device obtain blog URL addresses or RSS addresses simultaneously are gone to obtain the URL addresses or RSS addresses of other blogs by the linking relationship of each blog page；Using adopting Storage carries out the refreshing collection of increment to blog, and extracts the new blog article information delivered, and generates corresponding blog article information record simultaneously Storage.

Further, the detailed process of collection news data includes：

Webpage html source code is encoded, goes script and spcial character to process；

Format tags are removed to pretreated webpage html source code, coarse Web page text is obtained；

Extract the Web page text to be obtained in coarse Web page text using the default distribution function for seeking row number of words, and then To news data.

Further, the data transfer tool is Sqoop.

Compared with prior art, the beneficial effects of the invention are as follows：

(1) in face of the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition system Framework is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, has reached standard Really recognize, extract the information of separate sources and form, but efficiently, comprehensively gather information, additionally it is possible to tracking information in time Renewal, and reduce the workload of manual maintenance big data.

(2) present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent, For upper strata analysis and processing module provides information source comprehensively, stable, safe.

Brief description of the drawings

The Figure of description for constituting the part of the application is used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its illustrated for explaining the application, does not constitute the improper restriction to the application.

Fig. 1 is a kind of multichannel big data acquisition system structural representation of network-oriented of the invention.

Fig. 2 is forum information acquisition process figure.

Fig. 3 is that the column page data of the forum of forum obtains flow chart.

Fig. 4 is the system architecture diagram of blog data collecting unit.

Fig. 5 is the functional diagram of blog data collecting unit.

Fig. 6 is the text extracting framework based on row block distribution function method.

Fig. 7 is the frame diagram of HDFS.

Specific embodiment

It is noted that described further below is all exemplary, it is intended to provide further instruction to the application.Unless another Indicate, all technologies used herein and scientific terminology are with usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative Be also intended to include plural form, additionally, it should be understood that, when in this manual use term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.

As shown in figure 1, a kind of multichannel big data acquisition system of network-oriented of the invention, gathers single by forum data The distribution that data acquisition unit is constituted in unit, blog data collecting unit, news data collecting unit and relevant database Oriented acquisition architectural framework is constituted；

For the method that the feature of different type network, the present invention use oriented acquisition, with the terminal station in heterogeneous networks Point for information gathering basic task unit, each acquisition tasks can using independent collection rule and strategy (such as depth, Collection renewal frequency, information extraction template etc.).For network data acquisition in the requirement of the aspect such as scale and flexibility, use The distributed oriented acquisition architectural framework of " master slave distribution, autonomous collaboration ".

In face of the network information and its diversified message form of magnanimity, should recognize exactly, extract separate sources and The information of form, again efficiently, comprehensively gather information, will also tracking information in time renewal, and subtract as far as possible Reduced-maintenance workload.Therefore, the present invention is visited using newest vertical search template Semi-automatic Generation, dynamic page optimization Technology and intelligentized crawl process scheduling policy are asked, the high efficiency of grid information access process, complete is ensured to greatest extent Face property, promptness, for upper strata analysis and processing module provides information source comprehensively, stable, safe.

The information source that forum information is obtained is with column as base unit.Certain column is given, to the letter of the column Breath is obtained and mainly include four-stage (may be parallel in actual motion)：Column webpage acquisition → column web page extraction → model net Page acquisition → model web page extraction.As shown in Figure 2.

Gathered based on column entrance, the column of collection needed for can directly positioning, this obtains demand with the orientation of data Exactly identical.Obtained by column webpage, column info web is extracted, model webpage is obtained and model info web is extracted Four-stage, web retrieval is extracted with info web and is organically combined, and is efficiently solved conventional information acquisition technique and is deposited Problems.

There is the index list of model in the column page of forum, the metamessage of abundant model has been contained in list.List In every a line have recorded a constituent element information of model, including the theme of model, the people that posts, the time of posting, hits, return Plural number etc..These metamessages are very important for the data analysis of forum.The institutional framework of the column page generally compares to be had Rule, the metamessage of model can be effectively extracted based on the column page.The method is divided into two parts：1. from the column page Extract metadata, why referred to as metadata is rather than metamessage because these data implication (such as title, post people Deng) be not aware that；2. Metadata integration is put in storage：The implication (referred to as to the parsing of metadata) of metadata is identified, makes unit Data turn into real metamessage, preserve storage.The flow of whole method is as shown in Figure 3.

For the extraction of metadata, off-line operation includes：User provides a column page as the sample page, by nothing Supervised learning method, is that the column page similar with training examples generates a template.On-line operation includes：It is right according to template The new column page carries out Metadata Extraction.The extraction of metadata is operated based on DOM.Extraction process makes full use of the column page The corresponding relation between the node in attribute and dom tree in middle model record, record, and these nodes are in institutional framework Characteristic.Above-mentioned abstracting method has the advantages that extraction efficiency high, accurate positioning, maintenance cost are relatively low.

In specific implementation process, basic task unit also includes blog data collecting unit, and it is used to be responsible for range time Go through bog site, it is therefore an objective to obtain blog Feed addresses；Real-time Collection is carried out to each corresponding blog in Feed addresses, tracking is more New blog articles, blog information is gathered in incremental update mode.

Using system architecture as shown in Figure 4, system uses Distributed Design, has a Feed to find device and multiple information Collector.The target of Feed discovery modules is intended to find as much as possible RSS the or Atom addresses of blog below BSP.Pass through Analysis finds the URL addresses or RSS addresses of each BSP blog, it is found that they all have certain specification, can be by this Strategy recognizes whether a page is blog page, then goes to find by the linking relationship of each blog page more rich Visitor.

Collector is responsible for carrying out blog the refreshing collection of increment, and extracts the new blog article information delivered, and generation is corresponding Blog article information record is simultaneously put in storage.Function is as shown in Figure 5.The present invention is capable of the blog data of Real-time Collection renewal so that data are adopted Collection is in real time and accurate.

During news data is obtained, the text in news web page is extracted using the method based on row block distribution function Text, and then obtain news data.

The groundwork of body data pick-up be from Web included without in structure or semi-structured information recognize User's information interested and be translated into that structuring is strong, the meaning of one's words clearly data.The input of information extraction system is original Text, output is the information of set form.Finally, stored to relational database the data for extracting through over cleaning and after arranging In, accurately inquired about and pattern extraction for further data.

For convenience of the Chinese in effective crawl news web page, using in the method extraction webpage based on row block distribution function Body text, obtain document core content.Text extracting framework based on row block distribution function method is as shown in Figure 6.

In HTML, text is together with label always doping.Undeniably, modification of the label to word is true in word power There is great role on fixed and ranking results.But, also just because of html tag and text be interleaved with each other it is complicated and lack of standardization, make Obtain general text extracting to become difficult to achieve, finally have to define Different Rule, Space-time Complexity for different web sites Have a greatly reduced quality.

Based on this, the present invention proposes a kind of universal method based on row block distribution function, can be in linear session O (N) Extract text out.Propose that the method core foundation has at 2 points：1st, the density of text area, 2, the length of row block.

According to 1：The text region of one webpage is that text information is distributed one of most intensive region certainly, and this region can Can be maximum but exactly so, such as comment information is more long, or Web page text news is shorter, and the big tight navigation information of a piece occurs When, the region that text also occurs is not possible of largest block.

According to 2：The length information of row block can effectively solve the above problems.

It is combined according to 1 and according to 2, just can well realizes that text is extracted.Block distribution letter of being expert at will be merged according to 1 and 2 In several.It is specific as follows：

Webpage HTML is removed into label first, all texts are only stayed, while leaving all blank positions after label removal Information, the text for leaving referred to as Ctext.

Define 1. row blocks：

With the line number in Ctext as axle, K rows (context, K around it is taken<5, K=3 is taken here, direction is downward, K Referred to as row block thickness), a collectively row block Cblock, row block i is row blocks of the line number i with Ctext as axle；

Define 2. row block lengths：

One Cblock, remove all blank characters therein (n, r, t etc.) after character sum be referred to as the length of the row block Degree；

Define 3. row block distribution functions：

With Ctext per behavior axle, LinesNum (Ctext)-K Cblock is had, made with [1, LinesNum (Ctext)-K] it is transverse axis, with its distribution function of the respective row block length as the longitudinal axis；

Row block distribution function can be tried to achieve in O (N) times, and text institute can be intuitively found out on block distribution function figure of being expert at In region.By above-mentioned row block distribution function figure it will be evident that correctly text filed is all containing most on distribution function figure Value and a continuous region, this region often containing one rise sharply a little with a rapid drawdown point.

Then, Web page text extracting problem is converted to seek two boundary points of rapid drawdown that rise sharply on row block distribution function, this Region contained by two boundary points contains the row block length maximum of current web page and is continuous.

Powerful row block X where asking text region_startWith middle finger row block X_end(X is line number, and Y (X) is with X as axle Row block length), it is necessary to meet following four condition：

(1)Y(X_start)>Y(X_t)(Y(X_t) be first and rise sharply a little, rises sharply a certain threshold value that a little must be over)；

(2)Y(X_n(n ∈ [start+1, start+K], K are row block thickness, and following the row block length for rising sharply a little closely can not for) ≠ 0 It is 0, it is to avoid noise)；

(3)Y(X_m)=0 (m ∈ [end, end+1], the row block length that rapid drawdown clicker is trailed is 0, it is ensured that end of soliciting articles)；

(4) there is X, when max (Y (X)) is got, X ∈ [X_start,X_end] (ensure that this region is channel row block maximum Region).

Wherein, news data collecting unit of the invention includes webpage html source code pretreatment module, and it is used for webpage Html source code is encoded, goes script and spcial character to be processed；And

Collect in specific implementation process, in data acquisition unit in the relevant database, the data shift work It is Sqoop to have.

Hadoop frameworks are made up of distributed file system HDFS and MapReduce；HDFS is the file system of Hadoop, For storing super large file；MapReduce is the parallel programming model of Hadoop, for carrying out depth to the data stored on HDFS Degree analysis.

Hadoop realizes a distributed file system (Hadoop Distributed File System), referred to as HDFS.HDFS most begins as the architecture of Apache Nutch search engine projects and develops.

HDFS is mainly made up of Client, Datanode and Namenode, and its framework is as shown in Figure 7.One uses In the cluster of Hadoop Technical Architectures, typically there are one to two main frames as Namenode, some main frames are used as Datanode. Client is represented and is used the CLIENT PROGRAM of HDFS；Namenode is a main frame in Hadoop clusters, is responsible for preserving data section The tasks such as information, the distribution of calculating task and the final stipulations put；Datanode is responsible for data storage with treatment.To ensure number According to security, HDFS moderately increased redundant data.Specific way is that same data are preserved in different Datanode Multiple copies, generally three parts copy.

The present invention faces the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition body System structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, reaches Recognize, extract the information of separate sources and form exactly, but efficiently, comprehensively gather information, additionally it is possible to tracking letter in time The renewal of breath, and reduce the workload of manual maintenance big data.

The present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent, for upper Layer analysis processing module provides information source comprehensively, stable, safe.

Using data transfer tool come data in batch capture relevant database.

Specifically, during collection forum data is adopted, the network data in forum is carried out by base unit of column Collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four ranks Section, web retrieval and info web are extracted combine and obtain network data in forum.

Specifically, collection forum data adopt during, using Feed find device obtain blog URL addresses or RSS addresses simultaneously are gone to obtain the URL addresses or RSS addresses of other blogs by the linking relationship of each blog page；Using adopting Storage carries out the refreshing collection of increment to blog, and extracts the new blog article information delivered, and generates corresponding blog article information record simultaneously Storage.

Specifically, the detailed process of collection news data includes：

Wherein described data transfer tool is Sqoop.

Although above-mentioned be described with reference to accompanying drawing to specific embodiment of the invention, not to present invention protection model The limitation enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not Need the various modifications made by paying creative work or deformation still within protection scope of the present invention.

Claims

1. a kind of multichannel big data acquisition system of network-oriented, it is characterised in that the big number of multichannel of the network-oriented According to acquisition system by forum data collecting unit, blog data collecting unit, news data collecting unit and relevant database The distributed oriented acquisition architectural framework composition that middle data acquisition unit is constituted；

The forum data collecting unit, for passing through dynamic web page acquisition method and method for abstracting web page information respectively to online Network data in forum and offline forum is acquired；

The blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses；To every The corresponding blog in individual Feed addresses carries out Real-time Collection, tracks the blog articles for updating, and gathering blog in incremental update mode believes Breath；

The news data collecting unit, for extracting the text text in news web page using the method based on row block distribution function This, and then obtain news data；

Data acquisition unit in the relevant database, for using data transfer tool come batch capture relevant database Middle data.

2. the multichannel big data acquisition system of a kind of network-oriented as claimed in claim 1, it is characterised in that in institute's review In altar data acquisition unit, the network data in forum is acquired by base unit of column, by the acquisition of column webpage, version Block info web is extracted, model webpage is obtained and model info web extracts this four-stage, by web retrieval and info web Extraction combine and obtain network data in forum.

3. a kind of multichannel big data acquisition system of network-oriented as claimed in claim 1, it is characterised in that the blog Data acquisition unit is found that device and multiple information acquisition devices are constituted by a Feed, and the Feed has found that device is used to obtain blog URL addresses or RSS addresses and the linking relationship by each blog page are gone to obtain URL addresses or the RSS of other blogs Address；The collector is used to carry out blog the refreshing collection of increment, and extracts the new blog article information delivered, and generation is corresponding Blog article information record is simultaneously put in storage.

4. a kind of multichannel big data acquisition system of network-oriented as claimed in claim 1, it is characterised in that the news Data acquisition unit includes webpage html source code pretreatment module, it is used to encoding webpage html source code, go script and Spcial character is processed；And

Text extraction module, it is used to being extracted using the default distribution function for seeking row number of words to be obtained in coarse Web page text Web page text, and then obtain news data.

5. the multichannel big data acquisition system of a kind of network-oriented as claimed in claim 1, it is characterised in that in the pass It is that in type database in data acquisition unit, the data transfer tool is Sqoop.

6. the multichannel big data acquisition method of a kind of network-oriented, it is characterised in that it uses distributed oriented acquisition system Framework is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database；

Wherein, respectively by dynamic web page acquisition method and method for abstracting web page information to the net in online forum and offline forum Network data are acquired；

During blog data, first, blog Feed addresses are obtained；Then, each corresponding blog in Feed addresses is entered Row Real-time Collection, tracks the blog articles for updating, and blog information is gathered in incremental update mode；

Using data transfer tool come data in batch capture relevant database.

7. the multichannel big data acquisition method of a kind of network-oriented as claimed in claim 6, it is characterised in that discussed in collection During altar data are adopted, the network data in forum is acquired by base unit of column, by the acquisition of column webpage, version Block info web is extracted, model webpage is obtained and model info web extracts this four-stage, by web retrieval and info web Extraction combine and obtain network data in forum.

8. the multichannel big data acquisition method of a kind of network-oriented as claimed in claim 6, it is characterised in that discussed in collection During altar data are adopted, find that device obtains the URL addresses or RSS addresses of blog and passes through each blog page using Feed Linking relationship go obtain other blogs URL addresses or RSS addresses；The refreshing that blog carries out increment is adopted using collector Collection, and the new blog article information delivered is extracted, generate corresponding blog article information record and be put in storage.

9. a kind of multichannel big data acquisition method of network-oriented as claimed in claim 6, it is characterised in that collection news The detailed process of data includes：

The Web page text to be obtained in coarse Web page text is extracted using the default distribution function for seeking row number of words, and then obtains new Hear data.

10. a kind of multichannel big data acquisition method of network-oriented as claimed in claim 6, it is characterised in that the number It is Sqoop according to transfer tool.