CN109285046A - A kind of electric business big data acquisition system based on business plug-in unit - Google Patents
A kind of electric business big data acquisition system based on business plug-in unit Download PDFInfo
- Publication number
- CN109285046A CN109285046A CN201810905874.XA CN201810905874A CN109285046A CN 109285046 A CN109285046 A CN 109285046A CN 201810905874 A CN201810905874 A CN 201810905874A CN 109285046 A CN109285046 A CN 109285046A
- Authority
- CN
- China
- Prior art keywords
- task
- business
- big data
- processing module
- plug
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0641—Shopping interfaces
Abstract
Electric business big data acquisition system based on business plug-in unit under a kind of big data scene, for the content of pages and data type of electric business platform different style, for the individual web crawlers business of each Website construction, and realize plug-in unit management and the independent upgrading of business configuration, pass through the management to task, it can make called side plug and play, and third party is supported to call;The data collection system is deployed in server-side and client, is divided into three functional modules: task management module, task processing module and data processing module.The present invention is able to use family and rapidly and accurately grabs target electric quotient data, increases the scale of information collection.
Description
Technical field
The present invention relates to internet big data fields, and in particular to a kind of electric business big data acquisition based on business plug-in unit
System.
Background technique
With making rapid progress for Internet technology, e-commerce development is also very swift and violent, has become promotion regional economy
The important force of development.For background above, we need to obtain the big scale of construction, the data of high quality urgently to do analysis and adjust
It grinds, so as to more intuitively reflect regional electric business industry development situation, makes better adjustment in time.
Electric business big data mainly includes two pieces of contents: the essential information of electric business management body and the transaction of electric business management body
Information.These data distributions are in major mainstream electric business website, and content is intricate, configurations, and as the time pushes away
It moves and the development of technology, content and structure is also being weeded out the old and bring forth the new.In response to this, making single solution is clearly not
Reality.
Big data acquires a key technology as big data field, and important data branch is provided for big data analysis
It holds, to be filled with new vitality for traditional data analysing method.It can be obtained in a short time by way of web crawlers
A large amount of target data is got, and to data after certain processing, data become well arranged.
In traditional big data acquisition, emphasis is often to be directed to the acquisition of single website, or multiple websites are acquired
When crawler business immobilization, the system caused by the following system upgrade after electric business website upgrading that do not account for opens
Pin, substantially reduces user experience.
Summary of the invention
Electric quotient data is rapidly and accurately grabbed in order to realize in major mainstream electric business website, to be provincialism electricity
The assessment of quotient's development provides data and supports that the invention proposes the big numbers of electric business under a kind of big data scene based on business plug-in unit
According to acquisition system.
In order to solve the above-mentioned technical problem the invention provides the following technical scheme:
Electric business big data acquisition system based on business plug-in unit under a kind of big data scene, the data collection system portion
Administration is divided into three functional modules: task management module, task processing module and data processing module in server-side and client;
The task management module, for pending task being distributed, by the task service of claiming by task schedule service
Task to be processed is received to client, then transfers to task analysis service, which parse to task definition and rear knot
It closes task processing module management service to go to judge whether there is corresponding business processing plug-in unit, if it does not exist, then downloading is gone to respond
Business plug-in unit;If it exists, then it carries out version number's comparison and carries out respective handling;
The task processing module is made of different specific business plug-in units, for being responsible for actual crawler task;
The data processing module, the content of pages for returning to task processing module parse, and arrange storage.
It further, is development platform exploitation by Delphi IDE Integrated Development frame in the task management module, it is described
Task processing module is developed by Delphi language, using Indy component as the basis of http communication, data processing module
It is developed by Java language, carries out database purchase using MySQL.
Further, in the task management module, according to the content of pages and data type of electric business platform different style,
And the request of electric business website content is analyzed, different crawler capture programs is developed to major mainstream electric business website,
And be compiled into dynamic link library file and be deployed in the module, in the way of task management, third party is pushed or is stored in
Task in database is packaged with the data format of XML, is parsed by distribution, the dynamic download of finishing service module.
The mode of task management is realized by the striding course mechanism of message queue, is got rid of same in high concurrent
The blocking that step processing occurs can be requested by using message queue with asynchronous process, alleviate the pressure of system.
In the task processing module, goes to execute using the general export function of DLL and complete crawler business, based on existing common
It is counter climb strategy, by largely testing, analyze and summarize, summarized the counter of each electric business platform and climbed means and corresponding solution
Certainly method.
In the data processing module, for the different content of pages that crawler returns, different page parsing journeys is devised
Sequence goes to execute resolving by Auto-matching.
Technical concept of the invention: it for current a large amount of data collection system, there is acquisition website unicity and climbs
The characteristics of the problem of worm upgrading service, we are based on each electric business platform, on the basis of crawler business is almost the same, propose base
In the electric business big data acquisition method of business plug-in unit, in such a way that crawler business procedure is compiled into dynamic link library file
With the management of task, so that caller can be executed with the crawler business module that dynamic download needs, plug and play improves business
Scalability.In follow-up data treatment process, the corresponding parsing journey of content development that is returned for different crawler business
Sequence extracts the data wherein needed, achievees the effect that divide and rule.This method can be rapidly and accurately in major mainstream electric business net
It stands and acquires electric quotient data, provide data for the analysis of subsequent data and support, be of great significance to the development of electric business industry.
It is the individual web crawlers of each Website construction for the content of pages and data type of electric business platform different style
Business, and realize plug-in unit management and the independent upgrading of business configuration, by the management of task, called side can be made to insert i.e.
With, and third party is supported to call;The data collection system is deployed in server-side and client, is divided into three functional modules:
Task management module, task processing module and data processing module.The present invention is able to use family and rapidly and accurately grabs target electricity
Quotient data increases the scale of information collection.
Beneficial effects of the present invention are mainly manifested in: (1) being all individually designed crawler business of each mainstream electric business website, have
Stronger specific aim, the accuracy and scale for crawling data are improved.(2) crawler business is deployed in clothes in the form of DLL
It is engaged in end, being dynamically downloaded calling, enhancing the scalability of system, maintenance upgrade is carried out to these business by server-side,
Reduce overhead.(3) a variety of replies have also been carried out in terms of data processing, have devised a variety of data resolvings, have been improved
The accuracy of data loading.
Detailed description of the invention
Fig. 1 is the architecture design figure of the electric business big data acquisition system under big data scene based on business plug-in unit.
Fig. 2 is the flow chart that task generates.
Fig. 3 is the flow chart that task obtains.
Fig. 4 is that real-time task claims timing diagram.
Fig. 5 is that third party calls task to claim timing diagram.
Fig. 6 is crawler flow chart.
Fig. 7 is flow chart of data processing figure.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.
Referring to Fig.1~Fig. 7, the electric business big data acquisition system based on business plug-in unit under a kind of big data scene are described
Data collection system is deployed in server-side and client, is divided into three functional modules: task management module, task processing module and
Data processing module;
The task management module, for pending task being distributed, by the task service of claiming by task schedule service
Task to be processed is received to client, then transfers to task analysis service, which parse to task definition and rear knot
It closes task processing module management service to go to judge whether there is corresponding business processing plug-in unit, if it does not exist, then downloading is gone to respond
Business plug-in unit;If it exists, then it carries out version number's comparison and carries out respective handling;
The task processing module is made of different specific business plug-in units, for being responsible for actual crawler task;
The data processing module, the content of pages for returning to task processing module parse, and arrange storage.
It further, is development platform exploitation by Delphi IDE Integrated Development frame in the task management module, it is described
Task processing module is developed by Delphi language, using Indy component as the basis of http communication, data processing module
It is developed by Java language, carries out database purchase using MySQL.
Further, in the task management module, according to the content of pages and data type of electric business platform different style,
And the request of electric business website content is analyzed, different crawler capture programs is developed to major mainstream electric business website,
And be compiled into dynamic link library file and be deployed in the module, in the way of task management, third party is pushed or is stored in
Task in database is packaged with the data format of XML, is parsed by distribution, the dynamic download of finishing service module.
The mode of task management is realized by the striding course mechanism of message queue, is got rid of same in high concurrent
The blocking that step processing occurs can be requested by using message queue with asynchronous process, alleviate the pressure of system.
In the task processing module, goes to execute using the general export function of DLL and complete crawler business, based on existing common
It is counter climb strategy, by largely testing, analyze and summarize, summarized the counter of each electric business platform and climbed means and corresponding solution
Certainly method.
In the data processing module, for the different content of pages that crawler returns, different page parsing journeys is devised
Sequence goes to execute resolving by Auto-matching.
Shown in FIG. 1 is the system architecture of the electric business big data acquisition system under big data scene based on business plug-in unit
Scheme, illustrates the functional modules such as task management module, task processing module and data processing module in figure, wherein task management
Module includes that task schedule service, task claim service, task analysis service and task processing module management service again.Task pipe
Module is managed since the creation of task, by the task that task schedule service the is sent to client service of claiming, task claims clothes
Business reception task, transfers to task analysis service, by task analysis service parsing to task definition, and combines at task
Module management is managed, is equipped with suitable business module plug-in unit to specific task, and these business module plug-in units and task are given
Task processing module.Task processing module is mainly responsible for the execution of practical crawler business, and data processing module is mainly responsible for asking
It asks the web page contents of return to be parsed and is stored in database.
Fig. 2 brief process for illustrating task and generating.It extracts the key message in crawler task and is pushed to MQ (MQ is
A kind of communication mechanism of striding course transmits message for upstream and downstream), whether MQ consumption terminal receives message, judge task
Creation, if task is not created, parses the sources of message, and such as timed task, response time-out or response are abnormal for the first time
Task and repeated trigger task call different method to create crawler task according to different channels, and needing to be put into task
The task of queue is put into.
Fig. 3 schematically shows the process of task acquisition.In the case where not reaching timing, client only obtains determining for oneself
When crawler task, oneself task consumption after the completion of, just start obtain third party push task.Once timing has been crossed, it is excellent
The first task of consumption third party push, then consume the task in queue.After acquisition task, it is super whether validation task is arranged response
When, if setting, MQ delay message is generated for generating new task.
It is the specific implementation process of timing diagram and business plug-in unit that timed task is claimed represented by Fig. 4.Needle first
Different crawlers have been write to different electric business platforms, and these programs are compiled into dynamic link library file DLL deployment
In server-side.The service of claiming of the task of crawler i.e. start-up operation, polling task schedule service, when obtaining in module initialization
After getting waiting task, task is added and handles queue, transfers to task analysis service to handle, task analysis service is to task description
File is parsed, and the general contents of task description file are with containing task names, step to be treated, the corresponding page
Location, content type and parameter list requirement, in conjunction with task processing module management service, the crawler business i.e. DLL for going matching to need,
And be downloaded, transfer to task processing module to go to execute completion using the general export function of DLL.
Shown in fig. 5 to call task when client to claim timing diagram for third party, third party is opened up by client
Interface ceaselessly claims service distribution task to task, after getting waiting task, task is added and handles queue, transfer to appoint
Analysis service of being engaged in processing, task analysis service parse task description file, the crawler business i.e. DLL for going matching to need,
And be downloaded, transfer to task processing module to go to execute completion using the general export function of DLL.
Fig. 6 shows the basic procedure of crawler.1. once task processing module receives crawler task, at once to task
In URL generate queue, traversal reads URL, while marking the read access time of URL, prevents URL from repeating to crawl.2. first logging into
It needs to verify, keeps logging state by way of automated log on later.Automated log on is to utilize packet capturing work when first logging into
Tool, grab logging request URL and required head information, simulate this landing request information URL and logged in, if successfully stepping on
Record, will obtain current hosts for the cookie information and session voucher of the website.If session after a period of time
Failure, then re-start automated log on.3. encapsulating the required parameter of needs.4. utilizing hypertext by unified Locator address
Transport protocol simulation browser requests access to Website server, obtains web page contents.5. passing through the status_code verifying returned
Whether returned data is correct, if correctly, terminating this time crawler;If mistake, first records the URL and is stored in database,
And error message is sent to technical staff by mail, analyzed error reason by technical staff and is solved.6. checking URL queue
Whether it is sky, if it is empty, then closes connection, stop acquisition;If not empty, then it repeats 1. to 6. step.During crawler, needle
Several solutions of Measure Design are climbed to the counter of electric business platform: grabbing frequency facing lower than platform setting by controlling same IP
Boundary's frequency realizes the anti-crawler of 1688 platforms;By picture type text information identification technology, solve day cat and Taobao's platform pair
The dynamic verification code that website continuous acquisition data half an hour is logged under same IP inputs problem, to realize anti-crawler.
Fig. 7 shows the workflow of data processing module.1. web page contents are obtained, according to the affiliated electric business net of webpage
Station is gone to match corresponding page parsing program.2. the content extraction needed in webpage is come out by analysis program.3. establishing number
It is connected according to library, executes SQL statement, storage storage.During data parsing, for different data transmission formats, design
Different mechanism for resolving: 1. html page form utilizes the JAVA software tool pack of open source using DOM analysis mode
Html page is converted into Document class by jsoup, by the element position of CSS selector selector locating desired data,
HTML DOM file is parsed according to this CSS selector, generates the data of variable quantity.2. the data of JSON format use
Data are converted into JAVA class object by the Open-Source Tools such as GSON.jar, JSON.jar class, traverse each value, complete solution
Analysis.3. the data of XML format, the parsing of data is carried out using Dom4j open source JAVA kit, treatment process is similar to JSON lattice
The analysis mode of formula.4. the data of other user-defined formats analyze its format, according to Character segmentation, regular expression matching etc.
Mode carries out cutting parsing.
Claims (6)
1. the electric business big data acquisition system under a kind of big data scene based on business plug-in unit, it is characterised in that: the data
Acquisition system is deployed in server-side and client, is divided into three functional modules: task management module, task processing module and data
Processing module;
The task management module is received for distributing pending task by task schedule service by the task service of claiming
Then task to be processed transfers to task analysis service to client, which parse to task definition and rear combine is appointed
Business processing module management service goes to judge whether there is corresponding business processing plug-in unit, if it does not exist, then the industry for going downloading to respond
Business plug-in unit;If it exists, then it carries out version number's comparison and carries out respective handling;
The task processing module is made of different specific business plug-in units, for being responsible for actual crawler task, using DLL
General export function, which goes to execute, completes these business;
The data processing module, the content of pages for returning to task processing module parse, and arrange storage.
2. the electric business big data acquisition system under big data scene as described in claim 1 based on business plug-in unit, feature
It is: is development platform exploitation by Delphi IDE Integrated Development frame in the task management module, the task handles mould
Block is developed by Delphi language, and using Indy component as the basis of http communication, data processing module is by Java language
It develops, carries out database purchase using MySQL.
3. the electric business big data acquisition system under big data scene as claimed in claim 1 or 2 based on business plug-in unit, special
Sign is: in the task management module, according to the content of pages and data type of electric business platform different style, and to electric business
The requirement analysis of web site contents has write different crawler capture programs, and is compiled into dynamic link library file and is deployed in the mould
In block, in the way of task management, the task that third party pushes or is stored in database is beaten with the data format of XML
Packet parses, the dynamic download of finishing service module by distribution.
4. the electric business big data acquisition system under big data scene as claimed in claim 3 based on business plug-in unit, feature
Be: the mode of task management is realized by the striding course mechanism of message queue, is got rid of synchronous in high concurrent
The blocking that processing occurs can be requested by using message queue with asynchronous process, alleviate the pressure of system.
5. the electric business big data acquisition system under big data scene as claimed in claim 1 or 2 based on business plug-in unit, special
Sign is: the task processing module, goes to execute using the general export function of DLL and completes crawler business, is based on each electric business platform
Restriction summarize the counter of each electric business platform by test, analysis and summary and climb means and corresponding solution.
6. the electric business big data acquisition system under big data scene as claimed in claim 1 or 2 based on business plug-in unit, special
Sign is: the data processing module, for the different web sites page that crawler returns, designs different page parsing programs, leads to
It crosses Auto-matching to go to execute resolving, and is different page content types, take a variety of mechanism for resolving.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810905874.XA CN109285046A (en) | 2018-08-10 | 2018-08-10 | A kind of electric business big data acquisition system based on business plug-in unit |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810905874.XA CN109285046A (en) | 2018-08-10 | 2018-08-10 | A kind of electric business big data acquisition system based on business plug-in unit |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109285046A true CN109285046A (en) | 2019-01-29 |
Family
ID=65182754
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810905874.XA Pending CN109285046A (en) | 2018-08-10 | 2018-08-10 | A kind of electric business big data acquisition system based on business plug-in unit |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109285046A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347441A (en) * | 2019-06-25 | 2019-10-18 | 银江股份有限公司 | A kind of thermal expansion data integration automotive engine system and method |
CN110780983A (en) * | 2019-09-10 | 2020-02-11 | 中国平安财产保险股份有限公司 | Task exception handling method and device, computer equipment and storage medium |
CN112800361A (en) * | 2021-01-29 | 2021-05-14 | 麒麟合盛网络技术股份有限公司 | Content acquisition method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7096220B1 (en) * | 2000-05-24 | 2006-08-22 | Reachforce, Inc. | Web-based customer prospects harvester system |
CN103268319A (en) * | 2013-04-17 | 2013-08-28 | 北京航空航天大学 | Cloud browser based on webpages |
CN104765592A (en) * | 2014-01-03 | 2015-07-08 | 任子行网络技术股份有限公司 | Plugin management method and device facing web page acquisition task |
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN107317724A (en) * | 2017-06-06 | 2017-11-03 | 中证信用增进股份有限公司 | Data collecting system and method based on cloud computing technology |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
-
2018
- 2018-08-10 CN CN201810905874.XA patent/CN109285046A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7096220B1 (en) * | 2000-05-24 | 2006-08-22 | Reachforce, Inc. | Web-based customer prospects harvester system |
CN103268319A (en) * | 2013-04-17 | 2013-08-28 | 北京航空航天大学 | Cloud browser based on webpages |
CN104765592A (en) * | 2014-01-03 | 2015-07-08 | 任子行网络技术股份有限公司 | Plugin management method and device facing web page acquisition task |
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN107317724A (en) * | 2017-06-06 | 2017-11-03 | 中证信用增进股份有限公司 | Data collecting system and method based on cloud computing technology |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347441A (en) * | 2019-06-25 | 2019-10-18 | 银江股份有限公司 | A kind of thermal expansion data integration automotive engine system and method |
CN110780983A (en) * | 2019-09-10 | 2020-02-11 | 中国平安财产保险股份有限公司 | Task exception handling method and device, computer equipment and storage medium |
CN112800361A (en) * | 2021-01-29 | 2021-05-14 | 麒麟合盛网络技术股份有限公司 | Content acquisition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107895009B (en) | Distributed internet data acquisition method and system | |
US10204035B1 (en) | Systems, methods and devices for AI-driven automatic test generation | |
US10467316B2 (en) | Systems and methods for web analytics testing and web development | |
US8725794B2 (en) | Enhanced website tracking system and method | |
CN101044463B (en) | Method and system for monitoring performance of a client-server architecture | |
CN104063401B (en) | The method and apparatus that a kind of webpage pattern address merges | |
CN109285046A (en) | A kind of electric business big data acquisition system based on business plug-in unit | |
CN106027330A (en) | Message testing method for front-end system, and simulation baffle system | |
CN101222349A (en) | Method and system for collecting web user action and performance data | |
US10250632B2 (en) | Web service testing | |
CN105138454B (en) | A kind of automated testing method for B/S framework security softwares | |
CN103490896B (en) | Multi-user website automatic logger and achieving method thereof | |
WO2016082696A1 (en) | Ua recognition method and device | |
CN107256276A (en) | A kind of mobile App content safeties acquisition methods and equipment based on cloud platform | |
CN104750463B (en) | A kind of developing plug method and system | |
CN106209863B (en) | A kind of web portal security monitoring method based on whole station scanning | |
CN111651656B (en) | Method and system for dynamic webpage crawler based on agent mode | |
CN110737645B (en) | Data migration method and system among different systems and related equipment | |
CN110825641A (en) | Micro-service application test tool set based on simulation data generator | |
Vogel et al. | An in-depth analysis of web page structure and efficiency with focus on optimization potential for initial page load | |
US20150341381A1 (en) | Apparatus and method for collecting harmful website information | |
CN108256106B (en) | Simulation access website adapter system | |
CN109194670A (en) | A kind of any file download leak detection method in website | |
Koçi et al. | Improving web api usage logging | |
EP2937801A1 (en) | Harmful site collection device and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190129 |
|
RJ01 | Rejection of invention patent application after publication |