CN110175280A - A kind of crawler analysis platform based on government affairs big data - Google Patents
A kind of crawler analysis platform based on government affairs big data Download PDFInfo
- Publication number
- CN110175280A CN110175280A CN201910364873.3A CN201910364873A CN110175280A CN 110175280 A CN110175280 A CN 110175280A CN 201910364873 A CN201910364873 A CN 201910364873A CN 110175280 A CN110175280 A CN 110175280A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- user
- crawler
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000012216 screening Methods 0.000 claims abstract description 12
- 238000012800 visualization Methods 0.000 claims abstract description 9
- 238000000034 method Methods 0.000 claims description 30
- 230000008569 process Effects 0.000 claims description 20
- 238000005070 sampling Methods 0.000 claims description 19
- 238000004140 cleaning Methods 0.000 claims description 17
- 238000013499 data model Methods 0.000 claims description 17
- 238000001914 filtration Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 230000009471 action Effects 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 3
- 230000003993 interaction Effects 0.000 abstract 1
- 238000003032 molecular docking Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 12
- 238000012360 testing method Methods 0.000 description 10
- 208000025174 PANDAS Diseases 0.000 description 6
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 6
- 240000004718 Panda Species 0.000 description 6
- 235000016496 Panda oleosa Nutrition 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012544 monitoring process Methods 0.000 description 5
- 238000007405 data analysis Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- FMGYKKMPNATWHP-UHFFFAOYSA-N Cyperquat Chemical compound C1=C[N+](C)=CC=C1C1=CC=CC=C1 FMGYKKMPNATWHP-UHFFFAOYSA-N 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 241000727913 Parthenocissus tricuspidata Species 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013079 data visualisation Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9532—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Tourism & Hospitality (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Educational Administration (AREA)
- Development Economics (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of crawler analysis platforms based on government affairs big data, support the distributed collaboration crawl of data, visualization crawl, intellectual analysis user behavior quickly filters out the data of user's needs, as long as distributed collaboration crawl carries out the processing capacity that enough clusters greatly improve system.Intellectual analysis garbled data supports user's self-defining data screening rule, analyzes the behavior of user's garbled data, makes screening all the more accurate.And efficient data-handling capacity and various custom rules, the abilities such as intelligent behavior analysis are enriched in platform offer, enterprise is helped to acquire the government data of various needs, the interaction and feedback being completed at the same time between user and government affairs form data closed loop, make docking government data no longer difficult.
Description
Technical field
The present invention relates to big data platform technical field, in particular to a kind of crawler analysis based on government affairs big data is flat
Platform.
Background technique
Although present big data technology is very hot, the big data industry in China is still in infancy, industrial chain
Develop and immature.After big data industrial park is established, may not there can be enough enterprise enterings, one cannot be formed completely
The big data ecosphere.
Existing big data application technology is there are safety and potential problem, main contents: first, threat suffered by big data
The safety problem namely often said necessarily becomes and is attacked when big data technology, system and application have accumulated a large amount of values
Target;Second, problem and side effect brought by the excessive abuse of big data, than it is more typical be exactly individual privacy leakage, and also
Including the leakage of big data analysis ability bring business secret and state secret leakage;Third, intelligence are asked with the safety in consciousness
Topic.To the side effect of the threat of big data, big data, the extreme intelligence of big data can all be hindered and be destroyed the hair of big data
Exhibition.
Therefore, how a kind of safety is provided, realizes that the big data platform that efficiently utilizes of government data is those skilled in the art
The problem of member's urgent need to resolve.
Summary of the invention
The present invention provides a kind of crawler analysis platforms based on government affairs big data will be in advanced big data technical foundation
On, in view of the shortcomings of the prior art, big data application technology coverage rate is expanded, allows more users that can apply government affairs big data
Technology can realize the efficient utilization of government data by big data application technology, while also ensure government data safety, allow
Government data forms a more perfect big data ecosphere.Concrete scheme is as follows:
A kind of crawler analysis platform based on government affairs big data, including data acquisition module, data cleansing module, data point
Generic module, data sampling module, data statistics module, data-pushing module;
Wherein, the data acquisition module is for crawling website data;
The data acquisition module is connect with the data cleansing module, gives the website data crawled to data cleansing mould
Block carries out data cleansing according to the cleaning rule of recommendation;
The data cleansing module is connect with the data categorization module, the number that the data cleansing module is cleaned out
According to the multiple classification processing of progress;
The data categorization module is connect with the data sampling module, by the sorted data of the data categorization module
It is sampled according to expected data model;
The data sampling module is connect with the data statistics module, by the data after data sampling module sampling
It is counted, obtains the data of different user attention rate;
The data statistics module is connect with the data-pushing module, by the data of the data statistics module according to
Family attention rate pushes user, and user's expected data in the expected data model is pushed to the website data source, shape
At data closed loop.
Preferably, the data acquisition module uses Scrapy frame, carries out visualization crawl to website data and/or divides
Cloth collaboration crawl, the website includes government websites or other partner sites;
Wherein, the visualization crawl process includes:
Website data described in S1, online analyzing generates data topology figure, and calculates collection rule automatically according to user's operation;
S2, the data structure of acquisition is converted;
S3, primary filtration operation of the user to visualization interface to collection result, and the website data crawled are received;
The distributed collaboration grabs process
S1, it counts website data and is distributed to each crawler;
S2, the cooperation of multinode crawler crawl website data;
S3, multinode crawler data summarization.
Preferably, the cleaning rule of the data cleansing module includes data screening, data replacement;
The data screening is to screen to incomplete data, wrong data and repeated data;
The data replace with the data that will be screened by screening classification of type, receive user to the place for filtering out data
Operational order is managed, and the processing operation mode of user is saved as into a kind of scheme, user is recommended and uses next time;The processing behaviour
Making instruction includes that selection alternative is handled;Then the data after cleaning are given to data categorization module.
Preferably, the assorting process of the data categorization module includes:
S1, elder generation carry out preliminary classification according to the incidence relation of data source and data source;
S2, after being classified according to data source, classified according to the similarity of data format;
S3, it is finely divided according to user-defined classifying rules;
S4, according to subdivision before and subdivision after data, compare adjustment;
S5, weight is increased to the action type of the primary subdivision classifying rules of every execution.
Preferably, the sampling process of the data sampling module includes:
S1, expected data model is established;
S2, it sorted data is input in corresponding expected data model is trained, reject underproof sample;
S3, it chooses whether will to weed out data and is input to training in other expected data models;
S4, judged whether to increase expected data condition according to the data that training is completed, return to S2 and train again;
S5, final sampled data is obtained, typing to storage unit is simultaneously sent to data statistics module.
Preferably, the statistic processes of the data statistics module includes:
S1, the expected data for obtaining simultaneously counting user;
S2, it subscribed to according to the data of typing to storage unit by user, generate report using the data of generation;
S3, the data for obtaining different user attention rate, and give data-pushing module.
Preferably, the data that the data-pushing module comes out data statistics module are given different users and are come
The closed loop of source website formation data.
Compared with the prior art the present invention has the advantages that
The present disclosure provides a kind of crawler analysis platforms based on government affairs big data will be in advanced big data technology
On the basis of, in view of the shortcomings of the prior art, using the mixed mode of Hadoop+MPP+ memory database as Technical Architecture, together
High concurrent, scalable, high performance big data system are realized in the acquisition and calculating of Shi Caiyong Python technical support real time data
System expands big data application technology coverage rate, allows more enterprises using government affairs big data technology, realizes and be applied to government affairs number
According to big data platform the efficient utilization of government data can be realized by big data application technology, while also ensuring government affairs number
According to highly effective and safe, government data is allowed to form a more perfect big data ecosphere.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of structural framing figure of the crawler analysis platform based on government affairs big data of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
A kind of crawler analysis platform based on government affairs big data is present embodiments provided, referring to structural frames disclosed in attached drawing 1
Frame figure, including data acquisition module, data cleansing module, data categorization module, data sampling module, data statistics module, number
According to pushing module;Wherein, data acquisition module is for crawling website data;Data acquisition module is connect with data cleansing module,
It gives the website data crawled to data cleansing module, carries out data cleansing according to the cleaning rule of recommendation;Data cleansing module
It is connect with data categorization module, the data that data cleansing module is cleaned out carry out multiple classification processing;Data categorization module
It is connect with data sampling module, the data after sorting data into module classification are sampled according to expected data model;Data are adopted
Egf block is connect with data statistics module, and the data after data sampling module is sampled count, and obtains different user concern
The data of degree;Data statistics module is connect with data-pushing module, and the data of data statistics module are pushed away according to user's attention rate
User is sent, and user's expected data in expected data model is pushed to website data source, forms data closed loop.
Specifically, data acquisition module uses the Scrapy frame of Python, Scrapy is one to crawl website number
According to extracting structural data and the application framework write.It can be applied in data mining, information processing or store historical data
Etc. in a series of program.It is initially to apply designed by page crawl (for more precisely, network crawl)
Obtaining the data (such as Amazon Associates Web Services) or general web crawlers that API is returned.
Scrapy is widely used, can be used for data mining, monitoring and automatic test.Data acquisition module will use website data
The customized page fast resolving data visualization that Scrapy frame carries out distributed collaboration crawl data combination depth analyzes number
According to, by the result of screening give Scrapy go obtain data.Website includes government websites or other partner sites;
Wherein, visualization crawl process includes:
S1, quickly analysis extraction website data is visualized by similar means such as Boston ivy;
S2, the data structure that collected data are adjusted to user's needs according to the demand of user;
S3, primary filtration operation of the user to visualization interface to collection result, and the website data crawled are received;
Distributed collaboration grabs process
S1, it counts website data and is distributed to each crawler;
S2, website data is climbed by the Scrapy-redis component construction distribution multinode crawler of Scrapy
It takes;
S3, multinode crawler data summarization.
Data cleansing module cleans data, cleaning rule using python cleaning data and the customized cleaning rule of user
It then include data screening, data replacement;
Data screening is to screen to incomplete data, wrong data and repeated data;Wherein, the cleaning of incomplete data
Journey is to judge whether significant data is certain according to rule;The cleaning process of wrong data is to judge data format, number according to rule
According to boundary;The cleaning process of repeated data is to judge whether similarity matches, and is then matched according to regular Keyword Weight.
The data cleaned are handled by the rule for the definition for looking for alternative, are to abandon or replace, data are replaced
The data that will be screened are changed to by screening classification of type, it is (customized to the processing operation instruction for filtering out data to receive user
Cleaning rule), and the processing operation mode of user (customized cleaning rule) is saved as into a kind of scheme, recommends user
Next time uses;Processing operation instruction includes that selection alternative is handled;Then the data after cleaning are given to data classification
Module.
Data categorization module uses python, and the assorting process of data categorization module includes:
S1, first incidence relation (the customized basis according to data source (data providing, data use) and data source
The relationship binding of data providing, data use) carry out preliminary classification;
S2, after being classified according to data source, classified according to the similarity of data format;
S3, it is finely divided according to user-defined classifying rules;
S4, according to subdivision before and subdivision after data, be compared adjustment reduce error, execute repeatedly reaches error permission
It just no longer needs to adjust after in range;
S5, weight is increased to the action type of the primary subdivision classifying rules of every execution, improves the accuracy of classification.
By the output outline confirmation of sorted result it is errorless after, data are passed through into the mixed of Hadoop+MPP+ memory database
Syntype preserves, and is concentrated the data of each node using the dataworks of Ali.DataWorks, IBM are proposed
The one-stop Analysis Service based on big data platform is released, more intelligent service is provided for Industrial Data Management.The new item of IBM
Mesh DataWorks is constructed on the basis of Apache Spark and IBM Watson, its core concept is to guarantee data analysis
Under the premise of robustness, optimize data processing speed and ease for use.DataWorks borrows Pixiedust and Brunel etc.
Technology, so that user only just can be such that data simply visualize with line code, to start analysis, automatic load, to dissect
And classification data, to improve the quality of data.Visual data can form association and classification mode, so that in processing data
The user of business can be quickly obtained new see clearly.
The sampling process of data sampling module includes:
S1, expected data model is established;
S2, sorted Various types of data is sampled using pandas, is input to the expected data model that user selectes
In be trained, if it is selected just in each model it is trained, simulate various operations, reject underproof sample;It is expected that
Data model is the model that user establishes according to practical business;
S3, the unqualified sample weeded out is compared with the sample data in other expected data models, if phase
It is put into like degree compared with higher position and wherein trains, abandoned if all expectational models are all unsatisfactory for;
S4, judged whether to increase expected data condition according to the data that training is completed, return to S2 and train again;
S5, final sampled data is obtained, typing to storage unit is simultaneously sent to data statistics module.
Python Data Analysis Library or pandas are the tools based on NumPy, which is to understand
Certainly data analysis task and create.Pandas incorporates the data model in a large amount of libraries and some standards, provides and efficiently grasps
Tool needed for making large data collection.Pandas provides function and the side that so that us is quickly and easily handled data
Method.
Data statistics also realizes there is the very important data knot of two classes in pandas using the Pandas of Python
Structure, i.e. sequence Series and data frame DataFrame.Series is similar to the one-dimension array in numpy, in addition to a dimension of covering all
The available function of group or method, and it can obtain data by way of index tab, also have the automatic aligning function of index
Energy;DataFrame is similar to the two-dimensional array in numpy, equally can be main to use with the function and method of general numpy array
He carrys out the statistical analysis of various data in realization system, what data enterprise expectations obtain in statistical system, according to input system
Data the data of generations such as subscribed to, use by enterprise and generate report, count which data is paid high attention to, which data
Statistical data is sent to different booking readers by unexpected winner.The statistic processes of data statistics module includes:
User it is expected the data obtained in S1, statistical system;
S2, it subscribed to according to the data of typing to storage unit by user, generate report using the data of generation;
S3, the data for obtaining different user attention rate, and give data-pushing module.
The data that user or enterprise subscribe to are informed that business data has updated convenient for enterprise's acquisition data by data-pushing module,
Data service condition is informed into data source side, enterprise expectations data informing data source is convenient to decision, enterprise obtains number
According to and feed back, data source side provide data acquisition feedback be convenient for decision, formed data closed loop.
Other than above-mentioned core data processing, the management function of platform is provided, the more preferable operation for helping data.
Security system: using OAuth2 perfect licensing scheme come better management client, guarantee the safety of data.
OAuth2.0 is the continuity version of OAuth agreement, but not back compatible OAuth 1.0 has abrogated OAuth1.0 completely.OAuth
The simplification of 2.0 concern client developers.It is woven between resource owner and HTTP service quotient and is gone through by group
Interactive action represent user or allow third-party application represent user obtain access permission.It is applied simultaneously for Web, table
Face application and mobile phone and living room equipment provide special identifying procedure.
Monitoring resource: the service condition that Jmeter monitors every Taiwan investment source server is customized, Apache JMeter is to be based on
The pressure test tool of Java.For doing pressure test to software, it is originally designed for Web application test, but expands later
Open up other testing fields.For testing static and dynamic resource, for example, static file, Java servlet, CGI scripting,
Java object, database, ftp server, etc..JMeter can be used for huge negative to server, network or simulating
It carries, their intensity and analysis overall performance is tested under different pressures classification.In addition, JMeter can do application program
Function/regression test returns your desired result with the script asserted by creating to verify your program.For maximum
The flexibility of limit, JMeter, which allows to create using regular expression, to assert.Apache jmeter can be used for static sum
The performance of dynamic resource (file, Servlet, perl script, java object, database and inquiry, ftp server etc.) into
Row test.It can be used for testing server, network or the heavy load of simulating they intensity or analysis it is different
Overall performance under pressure type.You can be used it and do the pattern analysis of performance or in big your service of concurrent load testing
Device/script/object.
Log system: distributed system journal collects ELK, records system mistake log, Request Log, data day in detail
Will, operation log etc..ElK, i.e. ElasticSearch+Logstash+Kibana, ElasticSearch are one and are based on
The open source distributed search server of Lucene.Its feature has: distributed, zero configuration, automatic to find, indexes auto plate separation,
Index copy mechanism, restful style interface, multi-data source, automatic search overhead etc..It provides a distributed multi-user
The full-text search engine of ability is based on RESTful web interface.Elasticsearch is developed with Java, and conduct
Open source code publication under Apache license terms is the second popular enterprise search engine.It, can designed in cloud computing
Reach real-time search, stablizes, it is reliably, quickly, easy to install and use.Logstash is the tool increased income completely, it can be with
Your log is collected, filter, is analyzed, supports a large amount of data capture method, and is stored and (is such as searched for use later
Rope).Search is mentioned, logstash has a web interface, searches for and show all logs.General work mode is c/s frame
Structure, the end client are mounted on the host for needing collector journal, each node log that server is responsible at end to receive is filtered,
The operation such as modification is being sent to elasticsearch up together.Kibana is one based on browser page
Elasticsearch front-end presentation tool and an open source and free tool, Kibana can for Logstash and
The web interface for the log analysis close friend that ElasticSearch is provided, can help you to summarize, analyze and search for significant data day
Will.
System monitoring: working condition of the monitoring system under all nodes, health status, operation conditions, fortune at all levels
Situation in row makes corresponding processing
Mission Monitor: the task executions situation such as monitoring management data synchronous exchange
Message informing: the various state notifyings run in system are given to different users
System early warning: the bad border judgement relied on memory, network, capacity in system operation etc. carries out early warning, guarantees system
System runs the normal of bad border
System tracks: the reliability for saving source is tracked to any sensitive operation occurred in system
Intelligently guiding: according to the data cases of user, recommended user's correctly efficient operating system is successively guided
A kind of crawler analysis platform based on government affairs big data provided by the present invention is described in detail above, this
Apply that a specific example illustrates the principle and implementation of the invention in text, the explanation of above example is only intended to
It facilitates the understanding of the method and its core concept of the invention;At the same time, for those skilled in the art, think of according to the present invention
Think, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as pair
Limitation of the invention.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another
One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality
Relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to the packet of nonexcludability
Contain, so that the process, method, article or equipment for including a series of elements not only includes those elements, but also including
Other elements that are not explicitly listed, or further include for elements inherent to such a process, method, article, or device.
In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element
Process, method, article or equipment in there is also other identical elements.
Claims (7)
1. a kind of crawler analysis platform based on government affairs big data, it is characterised in that: including data acquisition module, data cleansing
Module, data categorization module, data sampling module, data statistics module, data-pushing module;
Wherein, the data acquisition module is for crawling website data;
The data acquisition module is connect with the data cleansing module, gives the website data crawled to data cleansing module,
Data cleansing is carried out according to the cleaning rule of recommendation;
The data cleansing module is connect with the data categorization module, the data that the data cleansing module is cleaned out into
The multiple classification processing of row;
The data categorization module is connect with the data sampling module, by the sorted data of the data categorization module according to
Expected data model is sampled;
The data sampling module is connect with the data statistics module, and the data after data sampling module sampling are carried out
Statistics, obtains the data of different user attention rate;
The data statistics module is connect with the data-pushing module, and the data of the data statistics module are closed according to user
Note degree pushes user, and user's expected data in the expected data model is pushed to the website data source, forms number
According to closed loop.
2. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data
Acquisition module uses Scrapy frame, carries out visualization crawl to website data and/or distributed collaboration grabs, the website packet
Include government websites or other partner sites;
Wherein, the visualization crawl process includes:
Website data described in S1, online analyzing generates data topology figure, and calculates collection rule automatically according to user's operation;
S2, the data structure of acquisition is converted;
S3, primary filtration operation of the user to visualization interface to collection result, and the website data crawled are received;
The distributed collaboration grabs process
S1, it counts website data and is distributed to each crawler;
S2, the cooperation of multinode crawler crawl website data;
S3, multinode crawler data summarization.
3. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data
The cleaning rule of cleaning module includes data screening, data replacement;
The data screening is to screen to incomplete data, wrong data and repeated data;
The data replace with the data that will be screened by screening classification of type, receive user and grasp to the processing for filtering out data
It instructs, and the processing operation mode of user is saved as into a kind of scheme, recommend user and use next time;The processing operation refers to
Enable includes that selection alternative is handled;Then the data after cleaning are given to data categorization module.
4. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data
The assorting process of categorization module includes:
S1, elder generation carry out preliminary classification according to the incidence relation of data source and data source;
S2, after being classified according to data source, classified according to the similarity of data format;
S3, it is finely divided according to user-defined classifying rules;
S4, according to subdivision before and subdivision after data, compare adjustment;
S5, weight is increased to the action type of the primary subdivision classifying rules of every execution.
5. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data
The sampling process of sampling module includes:
S1, expected data model is established;
S2, it sorted data is input in corresponding expected data model is trained, reject underproof sample;
S3, it chooses whether will to weed out data and is input to training in other expected data models;
S4, judged whether to increase expected data condition according to the data that training is completed, return to S2 and train again;
S5, final sampled data is obtained, typing to storage unit is simultaneously sent to data statistics module.
6. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data
The statistic processes of statistical module includes:
S1, the expected data for obtaining simultaneously counting user;
S2, it subscribed to according to the data of typing to storage unit by user, generate report using the data of generation;
S3, the data for obtaining different user attention rate, and give data-pushing module.
7. a kind of crawler analysis platform based on government affairs big data according to claim 1, which is characterized in that the data
The data that pushing module comes out data statistics module give different user and the source web closed loop for forming data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910364873.3A CN110175280A (en) | 2019-04-30 | 2019-04-30 | A kind of crawler analysis platform based on government affairs big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910364873.3A CN110175280A (en) | 2019-04-30 | 2019-04-30 | A kind of crawler analysis platform based on government affairs big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110175280A true CN110175280A (en) | 2019-08-27 |
Family
ID=67690445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910364873.3A Pending CN110175280A (en) | 2019-04-30 | 2019-04-30 | A kind of crawler analysis platform based on government affairs big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110175280A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765337A (en) * | 2019-11-15 | 2020-02-07 | 中科院计算技术研究所大数据研究院 | Service providing method based on internet big data |
CN111159188B (en) * | 2019-12-28 | 2023-05-09 | 北京慧博科技有限公司 | Processing method for realizing quasi-real-time large data volume based on DataWorks |
CN117076773A (en) * | 2023-08-23 | 2023-11-17 | 上海兰桂骐技术发展股份有限公司 | Data source screening and optimizing method based on internet information |
CN117076773B (en) * | 2023-08-23 | 2024-05-28 | 上海兰桂骐技术发展股份有限公司 | Data source screening and optimizing method based on internet information |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294390A (en) * | 2015-05-20 | 2017-01-04 | 上海纳鑫信息科技有限公司 | A kind of data mining analysis method and system |
CN106855962A (en) * | 2015-12-09 | 2017-06-16 | 星际空间(天津)科技发展有限公司 | A kind of method for building government affairs big data platform |
CN109213869A (en) * | 2017-06-29 | 2019-01-15 | 中国科学技术大学 | Hot spot technology prediction technique based on multi-source data |
CN109636352A (en) * | 2018-12-20 | 2019-04-16 | 湖南晖龙集团股份有限公司 | A kind of distributed content duplicate checking early warning system based on financial big data |
-
2019
- 2019-04-30 CN CN201910364873.3A patent/CN110175280A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294390A (en) * | 2015-05-20 | 2017-01-04 | 上海纳鑫信息科技有限公司 | A kind of data mining analysis method and system |
CN106855962A (en) * | 2015-12-09 | 2017-06-16 | 星际空间(天津)科技发展有限公司 | A kind of method for building government affairs big data platform |
CN109213869A (en) * | 2017-06-29 | 2019-01-15 | 中国科学技术大学 | Hot spot technology prediction technique based on multi-source data |
CN109636352A (en) * | 2018-12-20 | 2019-04-16 | 湖南晖龙集团股份有限公司 | A kind of distributed content duplicate checking early warning system based on financial big data |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765337A (en) * | 2019-11-15 | 2020-02-07 | 中科院计算技术研究所大数据研究院 | Service providing method based on internet big data |
CN110765337B (en) * | 2019-11-15 | 2021-04-06 | 中科院计算技术研究所大数据研究院 | Service providing method based on internet big data |
CN111159188B (en) * | 2019-12-28 | 2023-05-09 | 北京慧博科技有限公司 | Processing method for realizing quasi-real-time large data volume based on DataWorks |
CN117076773A (en) * | 2023-08-23 | 2023-11-17 | 上海兰桂骐技术发展股份有限公司 | Data source screening and optimizing method based on internet information |
CN117076773B (en) * | 2023-08-23 | 2024-05-28 | 上海兰桂骐技术发展股份有限公司 | Data source screening and optimizing method based on internet information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111435344B (en) | Big data-based drilling acceleration influence factor analysis model | |
Xiao et al. | Design rule spaces: A new form of architecture insight | |
Zaiane et al. | Towards evaluating learners' behaviour in a web-based distance learning environment | |
Koru et al. | An investigation of the effect of module size on defect prediction using static measures | |
CN101743542B (en) | Collecting and presenting temporal-based action information | |
CN104077402B (en) | Data processing method and data handling system | |
US20070143842A1 (en) | Method and system for acquisition and centralized storage of event logs from disparate systems | |
Linn et al. | Desktop activity mining-a new level of detail in mining business processes | |
CN104715047A (en) | Social network data collecting and analyzing system | |
CN106095965A (en) | A kind of data processing method and device | |
Hughes et al. | Designing an application for social media needs in emergency public information work | |
CN106209455A (en) | The associated services Fault Locating Method of a kind of cross-system weak coupling and system | |
Martino et al. | Temporal outlier analysis of online civil trial cases based on graph and process mining techniques | |
Ahmed et al. | Centralized log management using elasticsearch, logstash and kibana | |
CN110175280A (en) | A kind of crawler analysis platform based on government affairs big data | |
CN114399205A (en) | Procedural evaluation method, system and equipment suitable for project collaboration | |
CN109710667A (en) | A kind of shared realization method and system of the multisource data fusion based on big data platform | |
Anderson et al. | Architectural Implications of Social Media Analytics in Support of Crisis Informatics Research. | |
CN109446441A (en) | A kind of credible distributed capture storage system of general Web Community | |
CN110941836A (en) | Distributed vertical crawler method and terminal equipment | |
Slaninová et al. | From Moodle log file to the students network | |
Xu et al. | The application of web crawler in city image research | |
Lashari et al. | Monitoring public opinion by measuring the sentiment of retweets on Twitter | |
Das et al. | Popularity analysis on social network: a big data analysis | |
Dong | A Study on the Construction of Human Resources Audit Management Platform Based on Big Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190827 |
|
RJ01 | Rejection of invention patent application after publication |