CN106844782A - The multichannel big data acquisition system and method for a kind of network-oriented - Google Patents

The multichannel big data acquisition system and method for a kind of network-oriented Download PDF

Info

Publication number
CN106844782A
CN106844782A CN201710142262.5A CN201710142262A CN106844782A CN 106844782 A CN106844782 A CN 106844782A CN 201710142262 A CN201710142262 A CN 201710142262A CN 106844782 A CN106844782 A CN 106844782A
Authority
CN
China
Prior art keywords
data
blog
network
forum
oriented
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710142262.5A
Other languages
Chinese (zh)
Other versions
CN106844782B (en
Inventor
朱世伟
杨子江
于俊凤
李源
冯海洲
魏墨济
王燕
李思思
张铭君
王彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Internet Service Co ltd
Jingchuang United Beijing Intellectual Property Service Co ltd
Original Assignee
INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES filed Critical INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES
Priority to CN201710142262.5A priority Critical patent/CN106844782B/en
Publication of CN106844782A publication Critical patent/CN106844782A/en
Application granted granted Critical
Publication of CN106844782B publication Critical patent/CN106844782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation

Abstract

The invention discloses the multichannel big data acquisition system and method for network-oriented, wherein, the multichannel big data acquisition system of network-oriented is made up of forum data collecting unit, blog data collecting unit, news data collecting unit and data acquisition unit is constituted in relevant database distributed oriented acquisition architectural framework;Forum data collecting unit, for being acquired to the network data in online forum and offline forum by dynamic web page acquisition method and method for abstracting web page information respectively;Blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses;Real-time Collection is carried out to each corresponding blog in Feed addresses, the blog articles for updating are tracked, blog information is gathered in incremental update mode;News data collecting unit, for extracting the body text in news web page using the method based on row block distribution function;Data acquisition unit in relevant database, for using data transfer tool come data in batch capture relevant database.

Description

The multichannel big data acquisition system and method for a kind of network-oriented
Technical field
The invention belongs to network data processing field, more particularly to a kind of network-oriented multichannel big data acquisition system And method.
Background technology
It is natural resources, human resources strategic resource of equal importance that " big data " has become, and it is huge that it implies Society and economic worth have caused the great attention also of scientific and technological circle and enterprise.If effectively organized and use these big datas general Huge impetus is played to society and expanding economy.The data of these sharp increases mostly come from the daily life of people Living, particularly internet has become the maximum public information distribution centre of China and social groups' platform.With newspaper, radio broadcasting The communications media traditional with TV etc. is compared, the network media has that entry threshold is low, information is ultra-large, information issue with propagate Rapidly, the comprehensive features such as colony is huge, real-time, interactive is strong are participated in, have become society and politics, economic each field it is most quick, Extensive information channel.And how to find that useful information turns into the pass of government and every profession and trade in time from substantial amounts of internet data Heat injection point.
The characteristics of network data resource is big scale, and from different websites all over the world, distribution disperses, therefore, face How the network information and its diversified message form to magnanimity, recognize, extract the information of separate sources and form exactly, Efficiently, information is comprehensively gathered, in time the renewal of tracking information, the difficult point as big data acquisition, and it is big as the later stage The basis of data analysis accuracy.
The content of the invention
In order to solve the deficiencies in the prior art, the first object of the present invention provides a kind of multichannel big data of network-oriented Acquisition system.
A kind of multichannel big data acquisition system of network-oriented of the invention, the multichannel big data of the network-oriented Acquisition system is by forum data collecting unit, blog data collecting unit, news data collecting unit and relevant database The distributed oriented acquisition architectural framework composition that data acquisition unit is constituted;
The forum data collecting unit, for passing through dynamic web page acquisition method and method for abstracting web page information pair respectively Network data in online forum and offline forum is acquired;
The blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses; Real-time Collection is carried out to each corresponding blog in Feed addresses, the blog articles for updating are tracked, gathers rich in incremental update mode Objective information;
The news data collecting unit, for use in the method extraction news web page based on row block distribution function just Text, and then obtain news data;
Data acquisition unit in the relevant database, for using data transfer tool come batch capture relationship type number According to data in storehouse.
Further, in the forum data collecting unit, the network data in forum is entered by base unit of column Row collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four In the stage, web retrieval and info web are extracted combine and obtain network data in forum.
The present invention is be combined with each other by dynamic web page highly effective gathering technology and Web page information extraction technology, in real time, comprehensively, Accurately acquire the model and its relevant meta information specified in specified forum website in column.
Further, the blog data collecting unit is found that device and multiple information acquisition devices are constituted by a Feed, institute State Feed and find that device is used to obtain the URL addresses or RSS addresses of blog and the linking relationship by each blog page goes to obtain Take the URL addresses or RSS addresses of other blogs;The collector is used to carry out blog the refreshing collection of increment, and extracts The new blog article information delivered, generates corresponding blog article information record and is put in storage.
The present invention is capable of the blog data of Real-time Collection renewal so that data acquisition is in real time and accurate.
Further, the news data collecting unit includes webpage html source code pretreatment module, and it is used for webpage Html source code is encoded, goes script and spcial character to be processed;And
Format tags remove module, and it is used to be removed format tags to pretreated webpage html source code, obtains Coarse Web page text;And
Text extraction module, it is used to extract treating in coarse Web page text using the default distribution function for seeking row number of words Web page text is obtained, and then obtains news data.
The present invention intuitively can efficiently and accurately obtain news data.
Further, in data acquisition unit in the relevant database, the data transfer tool is Sqoop.
Sqoop is an instrument for the data in Hadoop and relevant database are mutually shifted, can be by one Number in individual relevant database (such as relevant database of any support JDBC specifications of MySQL, Oracle, Postgres) According in the HDFS for importing Hadoop.For some NoSQL databases, it also provides connector.Sqoop is similar to other ETL works Tool, data type is judged using metadata schema and the number of type safety is ensured when data are transferred to Hadoop from data source According to treatment.Sqoop aims at the design of big data bulk transfer, partitioned data set and can create Hadoop tasks and processes each area Block.
The second object of the present invention is to provide a kind of multichannel big data acquisition method of network-oriented.
A kind of multichannel big data acquisition method of network-oriented of the invention, it uses distributed oriented acquisition system frame Structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database;
Wherein, respectively by dynamic web page acquisition method and method for abstracting web page information in online forum and offline forum Network data be acquired;
During blog data, first, blog Feed addresses are obtained;Then, it is corresponding to each Feed address rich Visitor carries out Real-time Collection, tracks the blog articles for updating, and blog information is gathered in incremental update mode;
Body text in news web page is extracted using the method based on row block distribution function, and then obtains news data;
Using data transfer tool come data in batch capture relevant database.
Further, during collection forum data is adopted, the network data in forum is entered by base unit of column Row collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four In the stage, web retrieval and info web are extracted combine and obtain network data in forum.
Further, collection forum data adopt during, using Feed find device obtain blog URL addresses or RSS addresses simultaneously are gone to obtain the URL addresses or RSS addresses of other blogs by the linking relationship of each blog page;Using adopting Storage carries out the refreshing collection of increment to blog, and extracts the new blog article information delivered, and generates corresponding blog article information record simultaneously Storage.
Further, the detailed process of collection news data includes:
Webpage html source code is encoded, goes script and spcial character to process;
Format tags are removed to pretreated webpage html source code, coarse Web page text is obtained;
Extract the Web page text to be obtained in coarse Web page text using the default distribution function for seeking row number of words, and then To news data.
Further, the data transfer tool is Sqoop.
Compared with prior art, the beneficial effects of the invention are as follows:
(1) in face of the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition system Framework is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, has reached standard Really recognize, extract the information of separate sources and form, but efficiently, comprehensively gather information, additionally it is possible to tracking information in time Renewal, and reduce the workload of manual maintenance big data.
(2) present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent, For upper strata analysis and processing module provides information source comprehensively, stable, safe.
Brief description of the drawings
The Figure of description for constituting the part of the application is used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its illustrated for explaining the application, does not constitute the improper restriction to the application.
Fig. 1 is a kind of multichannel big data acquisition system structural representation of network-oriented of the invention.
Fig. 2 is forum information acquisition process figure.
Fig. 3 is that the column page data of the forum of forum obtains flow chart.
Fig. 4 is the system architecture diagram of blog data collecting unit.
Fig. 5 is the functional diagram of blog data collecting unit.
Fig. 6 is the text extracting framework based on row block distribution function method.
Fig. 7 is the frame diagram of HDFS.
Specific embodiment
It is noted that described further below is all exemplary, it is intended to provide further instruction to the application.Unless another Indicate, all technologies used herein and scientific terminology are with usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative Be also intended to include plural form, additionally, it should be understood that, when in this manual use term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.
Fig. 1 is a kind of multichannel big data acquisition system structural representation of network-oriented of the invention.
As shown in figure 1, a kind of multichannel big data acquisition system of network-oriented of the invention, gathers single by forum data The distribution that data acquisition unit is constituted in unit, blog data collecting unit, news data collecting unit and relevant database Oriented acquisition architectural framework is constituted;
The forum data collecting unit, for passing through dynamic web page acquisition method and method for abstracting web page information pair respectively Network data in online forum and offline forum is acquired;
The blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses; Real-time Collection is carried out to each corresponding blog in Feed addresses, the blog articles for updating are tracked, gathers rich in incremental update mode Objective information;
The news data collecting unit, for use in the method extraction news web page based on row block distribution function just Text, and then obtain news data;
Data acquisition unit in the relevant database, for using data transfer tool come batch capture relationship type number According to data in storehouse.
For the method that the feature of different type network, the present invention use oriented acquisition, with the terminal station in heterogeneous networks Point for information gathering basic task unit, each acquisition tasks can using independent collection rule and strategy (such as depth, Collection renewal frequency, information extraction template etc.).For network data acquisition in the requirement of the aspect such as scale and flexibility, use The distributed oriented acquisition architectural framework of " master slave distribution, autonomous collaboration ".
In face of the network information and its diversified message form of magnanimity, should recognize exactly, extract separate sources and The information of form, again efficiently, comprehensively gather information, will also tracking information in time renewal, and subtract as far as possible Reduced-maintenance workload.Therefore, the present invention is visited using newest vertical search template Semi-automatic Generation, dynamic page optimization Technology and intelligentized crawl process scheduling policy are asked, the high efficiency of grid information access process, complete is ensured to greatest extent Face property, promptness, for upper strata analysis and processing module provides information source comprehensively, stable, safe.
The present invention is be combined with each other by dynamic web page highly effective gathering technology and Web page information extraction technology, in real time, comprehensively, Accurately acquire the model and its relevant meta information specified in specified forum website in column.
The information source that forum information is obtained is with column as base unit.Certain column is given, to the letter of the column Breath is obtained and mainly include four-stage (may be parallel in actual motion):Column webpage acquisition → column web page extraction → model net Page acquisition → model web page extraction.As shown in Figure 2.
Gathered based on column entrance, the column of collection needed for can directly positioning, this obtains demand with the orientation of data Exactly identical.Obtained by column webpage, column info web is extracted, model webpage is obtained and model info web is extracted Four-stage, web retrieval is extracted with info web and is organically combined, and is efficiently solved conventional information acquisition technique and is deposited Problems.
There is the index list of model in the column page of forum, the metamessage of abundant model has been contained in list.List In every a line have recorded a constituent element information of model, including the theme of model, the people that posts, the time of posting, hits, return Plural number etc..These metamessages are very important for the data analysis of forum.The institutional framework of the column page generally compares to be had Rule, the metamessage of model can be effectively extracted based on the column page.The method is divided into two parts:1. from the column page Extract metadata, why referred to as metadata is rather than metamessage because these data implication (such as title, post people Deng) be not aware that;2. Metadata integration is put in storage:The implication (referred to as to the parsing of metadata) of metadata is identified, makes unit Data turn into real metamessage, preserve storage.The flow of whole method is as shown in Figure 3.
For the extraction of metadata, off-line operation includes:User provides a column page as the sample page, by nothing Supervised learning method, is that the column page similar with training examples generates a template.On-line operation includes:It is right according to template The new column page carries out Metadata Extraction.The extraction of metadata is operated based on DOM.Extraction process makes full use of the column page The corresponding relation between the node in attribute and dom tree in middle model record, record, and these nodes are in institutional framework Characteristic.Above-mentioned abstracting method has the advantages that extraction efficiency high, accurate positioning, maintenance cost are relatively low.
In specific implementation process, basic task unit also includes blog data collecting unit, and it is used to be responsible for range time Go through bog site, it is therefore an objective to obtain blog Feed addresses;Real-time Collection is carried out to each corresponding blog in Feed addresses, tracking is more New blog articles, blog information is gathered in incremental update mode.
Using system architecture as shown in Figure 4, system uses Distributed Design, has a Feed to find device and multiple information Collector.The target of Feed discovery modules is intended to find as much as possible RSS the or Atom addresses of blog below BSP.Pass through Analysis finds the URL addresses or RSS addresses of each BSP blog, it is found that they all have certain specification, can be by this Strategy recognizes whether a page is blog page, then goes to find by the linking relationship of each blog page more rich Visitor.
Collector is responsible for carrying out blog the refreshing collection of increment, and extracts the new blog article information delivered, and generation is corresponding Blog article information record is simultaneously put in storage.Function is as shown in Figure 5.The present invention is capable of the blog data of Real-time Collection renewal so that data are adopted Collection is in real time and accurate.
During news data is obtained, the text in news web page is extracted using the method based on row block distribution function Text, and then obtain news data.
The groundwork of body data pick-up be from Web included without in structure or semi-structured information recognize User's information interested and be translated into that structuring is strong, the meaning of one's words clearly data.The input of information extraction system is original Text, output is the information of set form.Finally, stored to relational database the data for extracting through over cleaning and after arranging In, accurately inquired about and pattern extraction for further data.
For convenience of the Chinese in effective crawl news web page, using in the method extraction webpage based on row block distribution function Body text, obtain document core content.Text extracting framework based on row block distribution function method is as shown in Figure 6.
In HTML, text is together with label always doping.Undeniably, modification of the label to word is true in word power There is great role on fixed and ranking results.But, also just because of html tag and text be interleaved with each other it is complicated and lack of standardization, make Obtain general text extracting to become difficult to achieve, finally have to define Different Rule, Space-time Complexity for different web sites Have a greatly reduced quality.
Based on this, the present invention proposes a kind of universal method based on row block distribution function, can be in linear session O (N) Extract text out.Propose that the method core foundation has at 2 points:1st, the density of text area, 2, the length of row block.
According to 1:The text region of one webpage is that text information is distributed one of most intensive region certainly, and this region can Can be maximum but exactly so, such as comment information is more long, or Web page text news is shorter, and the big tight navigation information of a piece occurs When, the region that text also occurs is not possible of largest block.
According to 2:The length information of row block can effectively solve the above problems.
It is combined according to 1 and according to 2, just can well realizes that text is extracted.Block distribution letter of being expert at will be merged according to 1 and 2 In several.It is specific as follows:
Webpage HTML is removed into label first, all texts are only stayed, while leaving all blank positions after label removal Information, the text for leaving referred to as Ctext.
Define 1. row blocks:
With the line number in Ctext as axle, K rows (context, K around it is taken<5, K=3 is taken here, direction is downward, K Referred to as row block thickness), a collectively row block Cblock, row block i is row blocks of the line number i with Ctext as axle;
Define 2. row block lengths:
One Cblock, remove all blank characters therein (n, r, t etc.) after character sum be referred to as the length of the row block Degree;
Define 3. row block distribution functions:
With Ctext per behavior axle, LinesNum (Ctext)-K Cblock is had, made with [1, LinesNum (Ctext)-K] it is transverse axis, with its distribution function of the respective row block length as the longitudinal axis;
Row block distribution function can be tried to achieve in O (N) times, and text institute can be intuitively found out on block distribution function figure of being expert at In region.By above-mentioned row block distribution function figure it will be evident that correctly text filed is all containing most on distribution function figure Value and a continuous region, this region often containing one rise sharply a little with a rapid drawdown point.
Then, Web page text extracting problem is converted to seek two boundary points of rapid drawdown that rise sharply on row block distribution function, this Region contained by two boundary points contains the row block length maximum of current web page and is continuous.
Powerful row block X where asking text regionstartWith middle finger row block Xend(X is line number, and Y (X) is with X as axle Row block length), it is necessary to meet following four condition:
(1)Y(Xstart)>Y(Xt)(Y(Xt) be first and rise sharply a little, rises sharply a certain threshold value that a little must be over);
(2)Y(Xn(n ∈ [start+1, start+K], K are row block thickness, and following the row block length for rising sharply a little closely can not for) ≠ 0 It is 0, it is to avoid noise);
(3)Y(Xm)=0 (m ∈ [end, end+1], the row block length that rapid drawdown clicker is trailed is 0, it is ensured that end of soliciting articles);
(4) there is X, when max (Y (X)) is got, X ∈ [Xstart,Xend] (ensure that this region is channel row block maximum Region).
The present invention intuitively can efficiently and accurately obtain news data.
Wherein, news data collecting unit of the invention includes webpage html source code pretreatment module, and it is used for webpage Html source code is encoded, goes script and spcial character to be processed;And
Format tags remove module, and it is used to be removed format tags to pretreated webpage html source code, obtains Coarse Web page text;And
Text extraction module, it is used to extract treating in coarse Web page text using the default distribution function for seeking row number of words Web page text is obtained, and then obtains news data.
The present invention intuitively can efficiently and accurately obtain news data.
Collect in specific implementation process, in data acquisition unit in the relevant database, the data shift work It is Sqoop to have.
Sqoop is an instrument for the data in Hadoop and relevant database are mutually shifted, can be by one Number in individual relevant database (such as relevant database of any support JDBC specifications of MySQL, Oracle, Postgres) According in the HDFS for importing Hadoop.For some NoSQL databases, it also provides connector.Sqoop is similar to other ETL works Tool, data type is judged using metadata schema and the number of type safety is ensured when data are transferred to Hadoop from data source According to treatment.Sqoop aims at the design of big data bulk transfer, partitioned data set and can create Hadoop tasks and processes each area Block.
Hadoop frameworks are made up of distributed file system HDFS and MapReduce;HDFS is the file system of Hadoop, For storing super large file;MapReduce is the parallel programming model of Hadoop, for carrying out depth to the data stored on HDFS Degree analysis.
Hadoop realizes a distributed file system (Hadoop Distributed File System), referred to as HDFS.HDFS most begins as the architecture of Apache Nutch search engine projects and develops.
HDFS is mainly made up of Client, Datanode and Namenode, and its framework is as shown in Figure 7.One uses In the cluster of Hadoop Technical Architectures, typically there are one to two main frames as Namenode, some main frames are used as Datanode. Client is represented and is used the CLIENT PROGRAM of HDFS;Namenode is a main frame in Hadoop clusters, is responsible for preserving data section The tasks such as information, the distribution of calculating task and the final stipulations put;Datanode is responsible for data storage with treatment.To ensure number According to security, HDFS moderately increased redundant data.Specific way is that same data are preserved in different Datanode Multiple copies, generally three parts copy.
The present invention faces the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition body System structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, reaches Recognize, extract the information of separate sources and form exactly, but efficiently, comprehensively gather information, additionally it is possible to tracking letter in time The renewal of breath, and reduce the workload of manual maintenance big data.
The present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent, for upper Layer analysis processing module provides information source comprehensively, stable, safe.
A kind of multichannel big data acquisition method of network-oriented of the invention, it uses distributed oriented acquisition system frame Structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database;
Wherein, respectively by dynamic web page acquisition method and method for abstracting web page information in online forum and offline forum Network data be acquired;
During blog data, first, blog Feed addresses are obtained;Then, it is corresponding to each Feed address rich Visitor carries out Real-time Collection, tracks the blog articles for updating, and blog information is gathered in incremental update mode;
Body text in news web page is extracted using the method based on row block distribution function, and then obtains news data;
Using data transfer tool come data in batch capture relevant database.
Specifically, during collection forum data is adopted, the network data in forum is carried out by base unit of column Collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four ranks Section, web retrieval and info web are extracted combine and obtain network data in forum.
The present invention is be combined with each other by dynamic web page highly effective gathering technology and Web page information extraction technology, in real time, comprehensively, Accurately acquire the model and its relevant meta information specified in specified forum website in column.
Specifically, collection forum data adopt during, using Feed find device obtain blog URL addresses or RSS addresses simultaneously are gone to obtain the URL addresses or RSS addresses of other blogs by the linking relationship of each blog page;Using adopting Storage carries out the refreshing collection of increment to blog, and extracts the new blog article information delivered, and generates corresponding blog article information record simultaneously Storage.
The present invention is capable of the blog data of Real-time Collection renewal so that data acquisition is in real time and accurate.
Specifically, the detailed process of collection news data includes:
Webpage html source code is encoded, goes script and spcial character to process;
Format tags are removed to pretreated webpage html source code, coarse Web page text is obtained;
Extract the Web page text to be obtained in coarse Web page text using the default distribution function for seeking row number of words, and then To news data.
Wherein described data transfer tool is Sqoop.
Sqoop is an instrument for the data in Hadoop and relevant database are mutually shifted, can be by one Number in individual relevant database (such as relevant database of any support JDBC specifications of MySQL, Oracle, Postgres) According in the HDFS for importing Hadoop.For some NoSQL databases, it also provides connector.Sqoop is similar to other ETL works Tool, data type is judged using metadata schema and the number of type safety is ensured when data are transferred to Hadoop from data source According to treatment.Sqoop aims at the design of big data bulk transfer, partitioned data set and can create Hadoop tasks and processes each area Block.
The present invention faces the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition body System structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, reaches Recognize, extract the information of separate sources and form exactly, but efficiently, comprehensively gather information, additionally it is possible to tracking letter in time The renewal of breath, and reduce the workload of manual maintenance big data.
The present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent, for upper Layer analysis processing module provides information source comprehensively, stable, safe.
Although above-mentioned be described with reference to accompanying drawing to specific embodiment of the invention, not to present invention protection model The limitation enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not Need the various modifications made by paying creative work or deformation still within protection scope of the present invention.

Claims (10)

1. a kind of multichannel big data acquisition system of network-oriented, it is characterised in that the big number of multichannel of the network-oriented According to acquisition system by forum data collecting unit, blog data collecting unit, news data collecting unit and relevant database The distributed oriented acquisition architectural framework composition that middle data acquisition unit is constituted;
The forum data collecting unit, for passing through dynamic web page acquisition method and method for abstracting web page information respectively to online Network data in forum and offline forum is acquired;
The blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses;To every The corresponding blog in individual Feed addresses carries out Real-time Collection, tracks the blog articles for updating, and gathering blog in incremental update mode believes Breath;
The news data collecting unit, for extracting the text text in news web page using the method based on row block distribution function This, and then obtain news data;
Data acquisition unit in the relevant database, for using data transfer tool come batch capture relevant database Middle data.
2. the multichannel big data acquisition system of a kind of network-oriented as claimed in claim 1, it is characterised in that in institute's review In altar data acquisition unit, the network data in forum is acquired by base unit of column, by the acquisition of column webpage, version Block info web is extracted, model webpage is obtained and model info web extracts this four-stage, by web retrieval and info web Extraction combine and obtain network data in forum.
3. a kind of multichannel big data acquisition system of network-oriented as claimed in claim 1, it is characterised in that the blog Data acquisition unit is found that device and multiple information acquisition devices are constituted by a Feed, and the Feed has found that device is used to obtain blog URL addresses or RSS addresses and the linking relationship by each blog page are gone to obtain URL addresses or the RSS of other blogs Address;The collector is used to carry out blog the refreshing collection of increment, and extracts the new blog article information delivered, and generation is corresponding Blog article information record is simultaneously put in storage.
4. a kind of multichannel big data acquisition system of network-oriented as claimed in claim 1, it is characterised in that the news Data acquisition unit includes webpage html source code pretreatment module, it is used to encoding webpage html source code, go script and Spcial character is processed;And
Format tags remove module, and it is used to be removed format tags to pretreated webpage html source code, obtains coarse Web page text;And
Text extraction module, it is used to being extracted using the default distribution function for seeking row number of words to be obtained in coarse Web page text Web page text, and then obtain news data.
5. the multichannel big data acquisition system of a kind of network-oriented as claimed in claim 1, it is characterised in that in the pass It is that in type database in data acquisition unit, the data transfer tool is Sqoop.
6. the multichannel big data acquisition method of a kind of network-oriented, it is characterised in that it uses distributed oriented acquisition system Framework is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database;
Wherein, respectively by dynamic web page acquisition method and method for abstracting web page information to the net in online forum and offline forum Network data are acquired;
During blog data, first, blog Feed addresses are obtained;Then, each corresponding blog in Feed addresses is entered Row Real-time Collection, tracks the blog articles for updating, and blog information is gathered in incremental update mode;
Body text in news web page is extracted using the method based on row block distribution function, and then obtains news data;
Using data transfer tool come data in batch capture relevant database.
7. the multichannel big data acquisition method of a kind of network-oriented as claimed in claim 6, it is characterised in that discussed in collection During altar data are adopted, the network data in forum is acquired by base unit of column, by the acquisition of column webpage, version Block info web is extracted, model webpage is obtained and model info web extracts this four-stage, by web retrieval and info web Extraction combine and obtain network data in forum.
8. the multichannel big data acquisition method of a kind of network-oriented as claimed in claim 6, it is characterised in that discussed in collection During altar data are adopted, find that device obtains the URL addresses or RSS addresses of blog and passes through each blog page using Feed Linking relationship go obtain other blogs URL addresses or RSS addresses;The refreshing that blog carries out increment is adopted using collector Collection, and the new blog article information delivered is extracted, generate corresponding blog article information record and be put in storage.
9. a kind of multichannel big data acquisition method of network-oriented as claimed in claim 6, it is characterised in that collection news The detailed process of data includes:
Webpage html source code is encoded, goes script and spcial character to process;
Format tags are removed to pretreated webpage html source code, coarse Web page text is obtained;
The Web page text to be obtained in coarse Web page text is extracted using the default distribution function for seeking row number of words, and then obtains new Hear data.
10. a kind of multichannel big data acquisition method of network-oriented as claimed in claim 6, it is characterised in that the number It is Sqoop according to transfer tool.
CN201710142262.5A 2017-03-10 2017-03-10 Network-oriented multi-channel big data acquisition system and method Active CN106844782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710142262.5A CN106844782B (en) 2017-03-10 2017-03-10 Network-oriented multi-channel big data acquisition system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710142262.5A CN106844782B (en) 2017-03-10 2017-03-10 Network-oriented multi-channel big data acquisition system and method

Publications (2)

Publication Number Publication Date
CN106844782A true CN106844782A (en) 2017-06-13
CN106844782B CN106844782B (en) 2020-03-20

Family

ID=59143833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710142262.5A Active CN106844782B (en) 2017-03-10 2017-03-10 Network-oriented multi-channel big data acquisition system and method

Country Status (1)

Country Link
CN (1) CN106844782B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN110472122A (en) * 2019-07-31 2019-11-19 重庆古扬科技有限公司 A kind of dynamic distributed academic resources acquisition method of multichannel
CN111581478A (en) * 2020-05-07 2020-08-25 成都信息工程大学 Cross-website general news acquisition method for specific subject
CN113626674A (en) * 2021-08-03 2021-11-09 杭州隆埠科技有限公司 News collecting system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504081A (en) * 2014-12-25 2015-04-08 北京东方剪报国际信息咨询有限公司 Intelligent analysis system for all-media detection and monitoring big data behaviors
CN105786864A (en) * 2014-12-24 2016-07-20 国家电网公司 Offline analysis method for massive data
US20160239500A1 (en) * 2013-12-02 2016-08-18 Qbase, LLC System and methods for extracting facts from unstructured text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160239500A1 (en) * 2013-12-02 2016-08-18 Qbase, LLC System and methods for extracting facts from unstructured text
CN105786864A (en) * 2014-12-24 2016-07-20 国家电网公司 Offline analysis method for massive data
CN104504081A (en) * 2014-12-25 2015-04-08 北京东方剪报国际信息咨询有限公司 Intelligent analysis system for all-media detection and monitoring big data behaviors

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李晨等: ""基于MapReduce的网络爬虫设计与实现"", 《山东科学》 *
贺涛: ""面向中文博客的信息采集与倾向性检索"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN110472122A (en) * 2019-07-31 2019-11-19 重庆古扬科技有限公司 A kind of dynamic distributed academic resources acquisition method of multichannel
CN111581478A (en) * 2020-05-07 2020-08-25 成都信息工程大学 Cross-website general news acquisition method for specific subject
CN113626674A (en) * 2021-08-03 2021-11-09 杭州隆埠科技有限公司 News collecting system

Also Published As

Publication number Publication date
CN106844782B (en) 2020-03-20

Similar Documents

Publication Publication Date Title
CN106934014A (en) A kind of network data excavation based on Hadoop and analysis platform and its method
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN1845104B (en) System and method for intelligent retrieval and processing of information
CN104462213A (en) User behavior analysis method and system based on big data
CN106844782A (en) The multichannel big data acquisition system and method for a kind of network-oriented
CN103546326B (en) Website traffic statistic method
CN103390038A (en) HBase-based incremental index creation and retrieval method
CN105022827A (en) Field subject-oriented Web news dynamic aggregation method
CN103020159A (en) Method and device for news presentation facing events
CN104951529A (en) Interactive analyzing method for website logs
CN102004775A (en) Intelligent-search-based Fujian Fujitsu search engine technology
CN102567521B (en) Webpage data capturing and filtering method
CN102646095A (en) Object classifying method and system based on webpage classification information
Wang et al. A novel blockchain oracle implementation scheme based on application specific knowledge engines
CN104765823A (en) Method and device for collecting website data
CN109710826A (en) A kind of internet information artificial intelligence acquisition method and its system
CN101639840A (en) Method and device for identifying semantic structure of network information
CN102156749A (en) Anatomic search and judgment method, system and distributed server system for map sites
CN106021580A (en) Impala cluster log analysis method and system based on Hadoop
den Besten Using social media to sample ideas: lessons from a Slate‐Twitter contest
CN106055572B (en) Page conversion parameter processing method and device
CN110321456B (en) Massive uncertain XML approximate query method
Malik et al. Ontology and Web Usage Mining towards an Intelligent Web focusing web logs
CN103268312B (en) A kind of corpus collection system based on user feedback and method thereof
Ramulu et al. A study of semantic web mining: Integrating domain knowledge into web mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221229

Address after: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee after: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

Address before: 250014 No. 19, ASTRI Road, Lixia District, Shandong, Ji'nan

Patentee before: INFORMATION Research Institute OF SHANDONG ACADEMY OF SCIENCES

Effective date of registration: 20221229

Address after: Room 606-609, Compound Office Complex Building, No. 757, Dongfeng East Road, Yuexiu District, Guangzhou, Guangdong Province, 510699

Patentee after: China Southern Power Grid Internet Service Co.,Ltd.

Address before: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee before: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.