CN110175280A

CN110175280A - A kind of crawler analysis platform based on government affairs big data

Info

Publication number: CN110175280A
Application number: CN201910364873.3A
Authority: CN
Inventors: 谢文龙; 杨嘉欣; 杨嘉烨
Original assignee: Guangdong Dingyi Interconnection Technology Co Ltd
Current assignee: Guangdong Dingyi Interconnection Technology Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2019-08-27

Abstract

The invention discloses a kind of crawler analysis platforms based on government affairs big data, support the distributed collaboration crawl of data, visualization crawl, intellectual analysis user behavior quickly filters out the data of user's needs, as long as distributed collaboration crawl carries out the processing capacity that enough clusters greatly improve system.Intellectual analysis garbled data supports user's self-defining data screening rule, analyzes the behavior of user's garbled data, makes screening all the more accurate.And efficient data-handling capacity and various custom rules, the abilities such as intelligent behavior analysis are enriched in platform offer, enterprise is helped to acquire the government data of various needs, the interaction and feedback being completed at the same time between user and government affairs form data closed loop, make docking government data no longer difficult.

Description

A kind of crawler analysis platform based on government affairs big data

Technical field

The present invention relates to big data platform technical field, in particular to a kind of crawler analysis based on government affairs big data is flat Platform.

Background technique

Although present big data technology is very hot, the big data industry in China is still in infancy, industrial chain Develop and immature.After big data industrial park is established, may not there can be enough enterprise enterings, one cannot be formed completely The big data ecosphere.

Existing big data application technology is there are safety and potential problem, main contents: first, threat suffered by big data The safety problem namely often said necessarily becomes and is attacked when big data technology, system and application have accumulated a large amount of values Target；Second, problem and side effect brought by the excessive abuse of big data, than it is more typical be exactly individual privacy leakage, and also Including the leakage of big data analysis ability bring business secret and state secret leakage；Third, intelligence are asked with the safety in consciousness Topic.To the side effect of the threat of big data, big data, the extreme intelligence of big data can all be hindered and be destroyed the hair of big data Exhibition.

Therefore, how a kind of safety is provided, realizes that the big data platform that efficiently utilizes of government data is those skilled in the art The problem of member's urgent need to resolve.

Summary of the invention

The present invention provides a kind of crawler analysis platforms based on government affairs big data will be in advanced big data technical foundation On, in view of the shortcomings of the prior art, big data application technology coverage rate is expanded, allows more users that can apply government affairs big data Technology can realize the efficient utilization of government data by big data application technology, while also ensure government data safety, allow Government data forms a more perfect big data ecosphere.Concrete scheme is as follows:

A kind of crawler analysis platform based on government affairs big data, including data acquisition module, data cleansing module, data point Generic module, data sampling module, data statistics module, data-pushing module；

Wherein, the data acquisition module is for crawling website data；

The data acquisition module is connect with the data cleansing module, gives the website data crawled to data cleansing mould Block carries out data cleansing according to the cleaning rule of recommendation；

The data cleansing module is connect with the data categorization module, the number that the data cleansing module is cleaned out According to the multiple classification processing of progress；

The data categorization module is connect with the data sampling module, by the sorted data of the data categorization module It is sampled according to expected data model；

The data sampling module is connect with the data statistics module, by the data after data sampling module sampling It is counted, obtains the data of different user attention rate；

The data statistics module is connect with the data-pushing module, by the data of the data statistics module according to Family attention rate pushes user, and user's expected data in the expected data model is pushed to the website data source, shape At data closed loop.

Preferably, the data acquisition module uses Scrapy frame, carries out visualization crawl to website data and/or divides Cloth collaboration crawl, the website includes government websites or other partner sites；

Wherein, the visualization crawl process includes:

Website data described in S1, online analyzing generates data topology figure, and calculates collection rule automatically according to user's operation；

S2, the data structure of acquisition is converted；

S3, primary filtration operation of the user to visualization interface to collection result, and the website data crawled are received；

The distributed collaboration grabs process

S1, it counts website data and is distributed to each crawler；

S2, the cooperation of multinode crawler crawl website data；

S3, multinode crawler data summarization.

Preferably, the cleaning rule of the data cleansing module includes data screening, data replacement；

The data screening is to screen to incomplete data, wrong data and repeated data；

The data replace with the data that will be screened by screening classification of type, receive user to the place for filtering out data Operational order is managed, and the processing operation mode of user is saved as into a kind of scheme, user is recommended and uses next time；The processing behaviour Making instruction includes that selection alternative is handled；Then the data after cleaning are given to data categorization module.

Preferably, the assorting process of the data categorization module includes:

S1, elder generation carry out preliminary classification according to the incidence relation of data source and data source；

S2, after being classified according to data source, classified according to the similarity of data format；

S3, it is finely divided according to user-defined classifying rules；

S4, according to subdivision before and subdivision after data, compare adjustment；

S5, weight is increased to the action type of the primary subdivision classifying rules of every execution.

Preferably, the sampling process of the data sampling module includes:

S1, expected data model is established；

S2, it sorted data is input in corresponding expected data model is trained, reject underproof sample；

S3, it chooses whether will to weed out data and is input to training in other expected data models；

S4, judged whether to increase expected data condition according to the data that training is completed, return to S2 and train again；

S5, final sampled data is obtained, typing to storage unit is simultaneously sent to data statistics module.

Preferably, the statistic processes of the data statistics module includes:

S1, the expected data for obtaining simultaneously counting user；

S2, it subscribed to according to the data of typing to storage unit by user, generate report using the data of generation；

S3, the data for obtaining different user attention rate, and give data-pushing module.

Preferably, the data that the data-pushing module comes out data statistics module are given different users and are come The closed loop of source website formation data.

Compared with the prior art the present invention has the advantages that

The present disclosure provides a kind of crawler analysis platforms based on government affairs big data will be in advanced big data technology On the basis of, in view of the shortcomings of the prior art, using the mixed mode of Hadoop+MPP+ memory database as Technical Architecture, together High concurrent, scalable, high performance big data system are realized in the acquisition and calculating of Shi Caiyong Python technical support real time data System expands big data application technology coverage rate, allows more enterprises using government affairs big data technology, realizes and be applied to government affairs number According to big data platform the efficient utilization of government data can be realized by big data application technology, while also ensuring government affairs number According to highly effective and safe, government data is allowed to form a more perfect big data ecosphere.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is a kind of structural framing figure of the crawler analysis platform based on government affairs big data of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

A kind of crawler analysis platform based on government affairs big data is present embodiments provided, referring to structural frames disclosed in attached drawing 1 Frame figure, including data acquisition module, data cleansing module, data categorization module, data sampling module, data statistics module, number According to pushing module；Wherein, data acquisition module is for crawling website data；Data acquisition module is connect with data cleansing module, It gives the website data crawled to data cleansing module, carries out data cleansing according to the cleaning rule of recommendation；Data cleansing module It is connect with data categorization module, the data that data cleansing module is cleaned out carry out multiple classification processing；Data categorization module It is connect with data sampling module, the data after sorting data into module classification are sampled according to expected data model；Data are adopted Egf block is connect with data statistics module, and the data after data sampling module is sampled count, and obtains different user concern The data of degree；Data statistics module is connect with data-pushing module, and the data of data statistics module are pushed away according to user's attention rate User is sent, and user's expected data in expected data model is pushed to website data source, forms data closed loop.

Specifically, data acquisition module uses the Scrapy frame of Python, Scrapy is one to crawl website number According to extracting structural data and the application framework write.It can be applied in data mining, information processing or store historical data Etc. in a series of program.It is initially to apply designed by page crawl (for more precisely, network crawl) Obtaining the data (such as Amazon Associates Web Services) or general web crawlers that API is returned. Scrapy is widely used, can be used for data mining, monitoring and automatic test.Data acquisition module will use website data The customized page fast resolving data visualization that Scrapy frame carries out distributed collaboration crawl data combination depth analyzes number According to, by the result of screening give Scrapy go obtain data.Website includes government websites or other partner sites；

Wherein, visualization crawl process includes:

S1, quickly analysis extraction website data is visualized by similar means such as Boston ivy；

S2, the data structure that collected data are adjusted to user's needs according to the demand of user；

Distributed collaboration grabs process

S1, it counts website data and is distributed to each crawler；

S2, website data is climbed by the Scrapy-redis component construction distribution multinode crawler of Scrapy It takes；

S3, multinode crawler data summarization.

Data cleansing module cleans data, cleaning rule using python cleaning data and the customized cleaning rule of user It then include data screening, data replacement；

Data screening is to screen to incomplete data, wrong data and repeated data；Wherein, the cleaning of incomplete data Journey is to judge whether significant data is certain according to rule；The cleaning process of wrong data is to judge data format, number according to rule According to boundary；The cleaning process of repeated data is to judge whether similarity matches, and is then matched according to regular Keyword Weight.

The data cleaned are handled by the rule for the definition for looking for alternative, are to abandon or replace, data are replaced The data that will be screened are changed to by screening classification of type, it is (customized to the processing operation instruction for filtering out data to receive user Cleaning rule), and the processing operation mode of user (customized cleaning rule) is saved as into a kind of scheme, recommends user Next time uses；Processing operation instruction includes that selection alternative is handled；Then the data after cleaning are given to data classification Module.

Data categorization module uses python, and the assorting process of data categorization module includes:

S1, first incidence relation (the customized basis according to data source (data providing, data use) and data source The relationship binding of data providing, data use) carry out preliminary classification；

S3, it is finely divided according to user-defined classifying rules；

S4, according to subdivision before and subdivision after data, be compared adjustment reduce error, execute repeatedly reaches error permission It just no longer needs to adjust after in range；

S5, weight is increased to the action type of the primary subdivision classifying rules of every execution, improves the accuracy of classification.

By the output outline confirmation of sorted result it is errorless after, data are passed through into the mixed of Hadoop+MPP+ memory database Syntype preserves, and is concentrated the data of each node using the dataworks of Ali.DataWorks, IBM are proposed The one-stop Analysis Service based on big data platform is released, more intelligent service is provided for Industrial Data Management.The new item of IBM Mesh DataWorks is constructed on the basis of Apache Spark and IBM Watson, its core concept is to guarantee data analysis Under the premise of robustness, optimize data processing speed and ease for use.DataWorks borrows Pixiedust and Brunel etc. Technology, so that user only just can be such that data simply visualize with line code, to start analysis, automatic load, to dissect And classification data, to improve the quality of data.Visual data can form association and classification mode, so that in processing data The user of business can be quickly obtained new see clearly.

The sampling process of data sampling module includes:

S1, expected data model is established；

S2, sorted Various types of data is sampled using pandas, is input to the expected data model that user selectes In be trained, if it is selected just in each model it is trained, simulate various operations, reject underproof sample；It is expected that Data model is the model that user establishes according to practical business；

S3, the unqualified sample weeded out is compared with the sample data in other expected data models, if phase It is put into like degree compared with higher position and wherein trains, abandoned if all expectational models are all unsatisfactory for；

Python Data Analysis Library or pandas are the tools based on NumPy, which is to understand Certainly data analysis task and create.Pandas incorporates the data model in a large amount of libraries and some standards, provides and efficiently grasps Tool needed for making large data collection.Pandas provides function and the side that so that us is quickly and easily handled data Method.

Data statistics also realizes there is the very important data knot of two classes in pandas using the Pandas of Python Structure, i.e. sequence Series and data frame DataFrame.Series is similar to the one-dimension array in numpy, in addition to a dimension of covering all The available function of group or method, and it can obtain data by way of index tab, also have the automatic aligning function of index Energy；DataFrame is similar to the two-dimensional array in numpy, equally can be main to use with the function and method of general numpy array He carrys out the statistical analysis of various data in realization system, what data enterprise expectations obtain in statistical system, according to input system Data the data of generations such as subscribed to, use by enterprise and generate report, count which data is paid high attention to, which data Statistical data is sent to different booking readers by unexpected winner.The statistic processes of data statistics module includes:

User it is expected the data obtained in S1, statistical system；

The data that user or enterprise subscribe to are informed that business data has updated convenient for enterprise's acquisition data by data-pushing module, Data service condition is informed into data source side, enterprise expectations data informing data source is convenient to decision, enterprise obtains number According to and feed back, data source side provide data acquisition feedback be convenient for decision, formed data closed loop.

Other than above-mentioned core data processing, the management function of platform is provided, the more preferable operation for helping data.

Security system: using OAuth2 perfect licensing scheme come better management client, guarantee the safety of data. OAuth2.0 is the continuity version of OAuth agreement, but not back compatible OAuth 1.0 has abrogated OAuth1.0 completely.OAuth The simplification of 2.0 concern client developers.It is woven between resource owner and HTTP service quotient and is gone through by group Interactive action represent user or allow third-party application represent user obtain access permission.It is applied simultaneously for Web, table Face application and mobile phone and living room equipment provide special identifying procedure.

Monitoring resource: the service condition that Jmeter monitors every Taiwan investment source server is customized, Apache JMeter is to be based on The pressure test tool of Java.For doing pressure test to software, it is originally designed for Web application test, but expands later Open up other testing fields.For testing static and dynamic resource, for example, static file, Java servlet, CGI scripting, Java object, database, ftp server, etc..JMeter can be used for huge negative to server, network or simulating It carries, their intensity and analysis overall performance is tested under different pressures classification.In addition, JMeter can do application program Function/regression test returns your desired result with the script asserted by creating to verify your program.For maximum The flexibility of limit, JMeter, which allows to create using regular expression, to assert.Apache jmeter can be used for static sum The performance of dynamic resource (file, Servlet, perl script, java object, database and inquiry, ftp server etc.) into Row test.It can be used for testing server, network or the heavy load of simulating they intensity or analysis it is different Overall performance under pressure type.You can be used it and do the pattern analysis of performance or in big your service of concurrent load testing Device/script/object.

Log system: distributed system journal collects ELK, records system mistake log, Request Log, data day in detail Will, operation log etc..ElK, i.e. ElasticSearch+Logstash+Kibana, ElasticSearch are one and are based on The open source distributed search server of Lucene.Its feature has: distributed, zero configuration, automatic to find, indexes auto plate separation, Index copy mechanism, restful style interface, multi-data source, automatic search overhead etc..It provides a distributed multi-user The full-text search engine of ability is based on RESTful web interface.Elasticsearch is developed with Java, and conduct Open source code publication under Apache license terms is the second popular enterprise search engine.It, can designed in cloud computing Reach real-time search, stablizes, it is reliably, quickly, easy to install and use.Logstash is the tool increased income completely, it can be with Your log is collected, filter, is analyzed, supports a large amount of data capture method, and is stored and (is such as searched for use later Rope).Search is mentioned, logstash has a web interface, searches for and show all logs.General work mode is c/s frame Structure, the end client are mounted on the host for needing collector journal, each node log that server is responsible at end to receive is filtered, The operation such as modification is being sent to elasticsearch up together.Kibana is one based on browser page Elasticsearch front-end presentation tool and an open source and free tool, Kibana can for Logstash and The web interface for the log analysis close friend that ElasticSearch is provided, can help you to summarize, analyze and search for significant data day Will.

System monitoring: working condition of the monitoring system under all nodes, health status, operation conditions, fortune at all levels Situation in row makes corresponding processing

Mission Monitor: the task executions situation such as monitoring management data synchronous exchange

Message informing: the various state notifyings run in system are given to different users

System early warning: the bad border judgement relied on memory, network, capacity in system operation etc. carries out early warning, guarantees system System runs the normal of bad border

System tracks: the reliability for saving source is tracked to any sensitive operation occurred in system

Intelligently guiding: according to the data cases of user, recommended user's correctly efficient operating system is successively guided

A kind of crawler analysis platform based on government affairs big data provided by the present invention is described in detail above, this Apply that a specific example illustrates the principle and implementation of the invention in text, the explanation of above example is only intended to It facilitates the understanding of the method and its core concept of the invention；At the same time, for those skilled in the art, think of according to the present invention Think, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as pair Limitation of the invention.

Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality Relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to the packet of nonexcludability Contain, so that the process, method, article or equipment for including a series of elements not only includes those elements, but also including Other elements that are not explicitly listed, or further include for elements inherent to such a process, method, article, or device. In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element Process, method, article or equipment in there is also other identical elements.

Claims

1. a kind of crawler analysis platform based on government affairs big data, it is characterised in that: including data acquisition module, data cleansing Module, data categorization module, data sampling module, data statistics module, data-pushing module；

Wherein, the data acquisition module is for crawling website data；

The data acquisition module is connect with the data cleansing module, gives the website data crawled to data cleansing module, Data cleansing is carried out according to the cleaning rule of recommendation；

The data cleansing module is connect with the data categorization module, the data that the data cleansing module is cleaned out into The multiple classification processing of row；

The data categorization module is connect with the data sampling module, by the sorted data of the data categorization module according to Expected data model is sampled；

The data sampling module is connect with the data statistics module, and the data after data sampling module sampling are carried out Statistics, obtains the data of different user attention rate；

The data statistics module is connect with the data-pushing module, and the data of the data statistics module are closed according to user Note degree pushes user, and user's expected data in the expected data model is pushed to the website data source, forms number According to closed loop.

2. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data Acquisition module uses Scrapy frame, carries out visualization crawl to website data and/or distributed collaboration grabs, the website packet Include government websites or other partner sites；

Wherein, the visualization crawl process includes:

S2, the data structure of acquisition is converted；

The distributed collaboration grabs process

S1, it counts website data and is distributed to each crawler；

S2, the cooperation of multinode crawler crawl website data；

S3, multinode crawler data summarization.

3. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data The cleaning rule of cleaning module includes data screening, data replacement；

The data replace with the data that will be screened by screening classification of type, receive user and grasp to the processing for filtering out data It instructs, and the processing operation mode of user is saved as into a kind of scheme, recommend user and use next time；The processing operation refers to Enable includes that selection alternative is handled；Then the data after cleaning are given to data categorization module.

4. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data The assorting process of categorization module includes:

S3, it is finely divided according to user-defined classifying rules；

5. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data The sampling process of sampling module includes:

S1, expected data model is established；

6. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data The statistic processes of statistical module includes:

S1, the expected data for obtaining simultaneously counting user；

7. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data The data that pushing module comes out data statistics module give different user and the source web closed loop for forming data.