CN109285046A - A kind of electric business big data acquisition system based on business plug-in unit - Google Patents

A kind of electric business big data acquisition system based on business plug-in unit Download PDF

Info

Publication number
CN109285046A
CN109285046A CN201810905874.XA CN201810905874A CN109285046A CN 109285046 A CN109285046 A CN 109285046A CN 201810905874 A CN201810905874 A CN 201810905874A CN 109285046 A CN109285046 A CN 109285046A
Authority
CN
China
Prior art keywords
task
business
big data
processing module
plug
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810905874.XA
Other languages
Chinese (zh)
Inventor
徐志江
李天琦
张昱
卢为党
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201810905874.XA priority Critical patent/CN109285046A/en
Publication of CN109285046A publication Critical patent/CN109285046A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces

Abstract

Electric business big data acquisition system based on business plug-in unit under a kind of big data scene, for the content of pages and data type of electric business platform different style, for the individual web crawlers business of each Website construction, and realize plug-in unit management and the independent upgrading of business configuration, pass through the management to task, it can make called side plug and play, and third party is supported to call;The data collection system is deployed in server-side and client, is divided into three functional modules: task management module, task processing module and data processing module.The present invention is able to use family and rapidly and accurately grabs target electric quotient data, increases the scale of information collection.

Description

A kind of electric business big data acquisition system based on business plug-in unit
Technical field
The present invention relates to internet big data fields, and in particular to a kind of electric business big data acquisition based on business plug-in unit System.
Background technique
With making rapid progress for Internet technology, e-commerce development is also very swift and violent, has become promotion regional economy The important force of development.For background above, we need to obtain the big scale of construction, the data of high quality urgently to do analysis and adjust It grinds, so as to more intuitively reflect regional electric business industry development situation, makes better adjustment in time.
Electric business big data mainly includes two pieces of contents: the essential information of electric business management body and the transaction of electric business management body Information.These data distributions are in major mainstream electric business website, and content is intricate, configurations, and as the time pushes away It moves and the development of technology, content and structure is also being weeded out the old and bring forth the new.In response to this, making single solution is clearly not Reality.
Big data acquires a key technology as big data field, and important data branch is provided for big data analysis It holds, to be filled with new vitality for traditional data analysing method.It can be obtained in a short time by way of web crawlers A large amount of target data is got, and to data after certain processing, data become well arranged.
In traditional big data acquisition, emphasis is often to be directed to the acquisition of single website, or multiple websites are acquired When crawler business immobilization, the system caused by the following system upgrade after electric business website upgrading that do not account for opens Pin, substantially reduces user experience.
Summary of the invention
Electric quotient data is rapidly and accurately grabbed in order to realize in major mainstream electric business website, to be provincialism electricity The assessment of quotient's development provides data and supports that the invention proposes the big numbers of electric business under a kind of big data scene based on business plug-in unit According to acquisition system.
In order to solve the above-mentioned technical problem the invention provides the following technical scheme:
Electric business big data acquisition system based on business plug-in unit under a kind of big data scene, the data collection system portion Administration is divided into three functional modules: task management module, task processing module and data processing module in server-side and client;
The task management module, for pending task being distributed, by the task service of claiming by task schedule service Task to be processed is received to client, then transfers to task analysis service, which parse to task definition and rear knot It closes task processing module management service to go to judge whether there is corresponding business processing plug-in unit, if it does not exist, then downloading is gone to respond Business plug-in unit;If it exists, then it carries out version number's comparison and carries out respective handling;
The task processing module is made of different specific business plug-in units, for being responsible for actual crawler task;
The data processing module, the content of pages for returning to task processing module parse, and arrange storage.
It further, is development platform exploitation by Delphi IDE Integrated Development frame in the task management module, it is described Task processing module is developed by Delphi language, using Indy component as the basis of http communication, data processing module It is developed by Java language, carries out database purchase using MySQL.
Further, in the task management module, according to the content of pages and data type of electric business platform different style, And the request of electric business website content is analyzed, different crawler capture programs is developed to major mainstream electric business website, And be compiled into dynamic link library file and be deployed in the module, in the way of task management, third party is pushed or is stored in Task in database is packaged with the data format of XML, is parsed by distribution, the dynamic download of finishing service module.
The mode of task management is realized by the striding course mechanism of message queue, is got rid of same in high concurrent The blocking that step processing occurs can be requested by using message queue with asynchronous process, alleviate the pressure of system.
In the task processing module, goes to execute using the general export function of DLL and complete crawler business, based on existing common It is counter climb strategy, by largely testing, analyze and summarize, summarized the counter of each electric business platform and climbed means and corresponding solution Certainly method.
In the data processing module, for the different content of pages that crawler returns, different page parsing journeys is devised Sequence goes to execute resolving by Auto-matching.
Technical concept of the invention: it for current a large amount of data collection system, there is acquisition website unicity and climbs The characteristics of the problem of worm upgrading service, we are based on each electric business platform, on the basis of crawler business is almost the same, propose base In the electric business big data acquisition method of business plug-in unit, in such a way that crawler business procedure is compiled into dynamic link library file With the management of task, so that caller can be executed with the crawler business module that dynamic download needs, plug and play improves business Scalability.In follow-up data treatment process, the corresponding parsing journey of content development that is returned for different crawler business Sequence extracts the data wherein needed, achievees the effect that divide and rule.This method can be rapidly and accurately in major mainstream electric business net It stands and acquires electric quotient data, provide data for the analysis of subsequent data and support, be of great significance to the development of electric business industry.
It is the individual web crawlers of each Website construction for the content of pages and data type of electric business platform different style Business, and realize plug-in unit management and the independent upgrading of business configuration, by the management of task, called side can be made to insert i.e. With, and third party is supported to call;The data collection system is deployed in server-side and client, is divided into three functional modules: Task management module, task processing module and data processing module.The present invention is able to use family and rapidly and accurately grabs target electricity Quotient data increases the scale of information collection.
Beneficial effects of the present invention are mainly manifested in: (1) being all individually designed crawler business of each mainstream electric business website, have Stronger specific aim, the accuracy and scale for crawling data are improved.(2) crawler business is deployed in clothes in the form of DLL It is engaged in end, being dynamically downloaded calling, enhancing the scalability of system, maintenance upgrade is carried out to these business by server-side, Reduce overhead.(3) a variety of replies have also been carried out in terms of data processing, have devised a variety of data resolvings, have been improved The accuracy of data loading.
Detailed description of the invention
Fig. 1 is the architecture design figure of the electric business big data acquisition system under big data scene based on business plug-in unit.
Fig. 2 is the flow chart that task generates.
Fig. 3 is the flow chart that task obtains.
Fig. 4 is that real-time task claims timing diagram.
Fig. 5 is that third party calls task to claim timing diagram.
Fig. 6 is crawler flow chart.
Fig. 7 is flow chart of data processing figure.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.
Referring to Fig.1~Fig. 7, the electric business big data acquisition system based on business plug-in unit under a kind of big data scene are described Data collection system is deployed in server-side and client, is divided into three functional modules: task management module, task processing module and Data processing module;
The task management module, for pending task being distributed, by the task service of claiming by task schedule service Task to be processed is received to client, then transfers to task analysis service, which parse to task definition and rear knot It closes task processing module management service to go to judge whether there is corresponding business processing plug-in unit, if it does not exist, then downloading is gone to respond Business plug-in unit;If it exists, then it carries out version number's comparison and carries out respective handling;
The task processing module is made of different specific business plug-in units, for being responsible for actual crawler task;
The data processing module, the content of pages for returning to task processing module parse, and arrange storage.
It further, is development platform exploitation by Delphi IDE Integrated Development frame in the task management module, it is described Task processing module is developed by Delphi language, using Indy component as the basis of http communication, data processing module It is developed by Java language, carries out database purchase using MySQL.
Further, in the task management module, according to the content of pages and data type of electric business platform different style, And the request of electric business website content is analyzed, different crawler capture programs is developed to major mainstream electric business website, And be compiled into dynamic link library file and be deployed in the module, in the way of task management, third party is pushed or is stored in Task in database is packaged with the data format of XML, is parsed by distribution, the dynamic download of finishing service module.
The mode of task management is realized by the striding course mechanism of message queue, is got rid of same in high concurrent The blocking that step processing occurs can be requested by using message queue with asynchronous process, alleviate the pressure of system.
In the task processing module, goes to execute using the general export function of DLL and complete crawler business, based on existing common It is counter climb strategy, by largely testing, analyze and summarize, summarized the counter of each electric business platform and climbed means and corresponding solution Certainly method.
In the data processing module, for the different content of pages that crawler returns, different page parsing journeys is devised Sequence goes to execute resolving by Auto-matching.
Shown in FIG. 1 is the system architecture of the electric business big data acquisition system under big data scene based on business plug-in unit Scheme, illustrates the functional modules such as task management module, task processing module and data processing module in figure, wherein task management Module includes that task schedule service, task claim service, task analysis service and task processing module management service again.Task pipe Module is managed since the creation of task, by the task that task schedule service the is sent to client service of claiming, task claims clothes Business reception task, transfers to task analysis service, by task analysis service parsing to task definition, and combines at task Module management is managed, is equipped with suitable business module plug-in unit to specific task, and these business module plug-in units and task are given Task processing module.Task processing module is mainly responsible for the execution of practical crawler business, and data processing module is mainly responsible for asking It asks the web page contents of return to be parsed and is stored in database.
Fig. 2 brief process for illustrating task and generating.It extracts the key message in crawler task and is pushed to MQ (MQ is A kind of communication mechanism of striding course transmits message for upstream and downstream), whether MQ consumption terminal receives message, judge task Creation, if task is not created, parses the sources of message, and such as timed task, response time-out or response are abnormal for the first time Task and repeated trigger task call different method to create crawler task according to different channels, and needing to be put into task The task of queue is put into.
Fig. 3 schematically shows the process of task acquisition.In the case where not reaching timing, client only obtains determining for oneself When crawler task, oneself task consumption after the completion of, just start obtain third party push task.Once timing has been crossed, it is excellent The first task of consumption third party push, then consume the task in queue.After acquisition task, it is super whether validation task is arranged response When, if setting, MQ delay message is generated for generating new task.
It is the specific implementation process of timing diagram and business plug-in unit that timed task is claimed represented by Fig. 4.Needle first Different crawlers have been write to different electric business platforms, and these programs are compiled into dynamic link library file DLL deployment In server-side.The service of claiming of the task of crawler i.e. start-up operation, polling task schedule service, when obtaining in module initialization After getting waiting task, task is added and handles queue, transfers to task analysis service to handle, task analysis service is to task description File is parsed, and the general contents of task description file are with containing task names, step to be treated, the corresponding page Location, content type and parameter list requirement, in conjunction with task processing module management service, the crawler business i.e. DLL for going matching to need, And be downloaded, transfer to task processing module to go to execute completion using the general export function of DLL.
Shown in fig. 5 to call task when client to claim timing diagram for third party, third party is opened up by client Interface ceaselessly claims service distribution task to task, after getting waiting task, task is added and handles queue, transfer to appoint Analysis service of being engaged in processing, task analysis service parse task description file, the crawler business i.e. DLL for going matching to need, And be downloaded, transfer to task processing module to go to execute completion using the general export function of DLL.
Fig. 6 shows the basic procedure of crawler.1. once task processing module receives crawler task, at once to task In URL generate queue, traversal reads URL, while marking the read access time of URL, prevents URL from repeating to crawl.2. first logging into It needs to verify, keeps logging state by way of automated log on later.Automated log on is to utilize packet capturing work when first logging into Tool, grab logging request URL and required head information, simulate this landing request information URL and logged in, if successfully stepping on Record, will obtain current hosts for the cookie information and session voucher of the website.If session after a period of time Failure, then re-start automated log on.3. encapsulating the required parameter of needs.4. utilizing hypertext by unified Locator address Transport protocol simulation browser requests access to Website server, obtains web page contents.5. passing through the status_code verifying returned Whether returned data is correct, if correctly, terminating this time crawler;If mistake, first records the URL and is stored in database, And error message is sent to technical staff by mail, analyzed error reason by technical staff and is solved.6. checking URL queue Whether it is sky, if it is empty, then closes connection, stop acquisition;If not empty, then it repeats 1. to 6. step.During crawler, needle Several solutions of Measure Design are climbed to the counter of electric business platform: grabbing frequency facing lower than platform setting by controlling same IP Boundary's frequency realizes the anti-crawler of 1688 platforms;By picture type text information identification technology, solve day cat and Taobao's platform pair The dynamic verification code that website continuous acquisition data half an hour is logged under same IP inputs problem, to realize anti-crawler.
Fig. 7 shows the workflow of data processing module.1. web page contents are obtained, according to the affiliated electric business net of webpage Station is gone to match corresponding page parsing program.2. the content extraction needed in webpage is come out by analysis program.3. establishing number It is connected according to library, executes SQL statement, storage storage.During data parsing, for different data transmission formats, design Different mechanism for resolving: 1. html page form utilizes the JAVA software tool pack of open source using DOM analysis mode Html page is converted into Document class by jsoup, by the element position of CSS selector selector locating desired data, HTML DOM file is parsed according to this CSS selector, generates the data of variable quantity.2. the data of JSON format use Data are converted into JAVA class object by the Open-Source Tools such as GSON.jar, JSON.jar class, traverse each value, complete solution Analysis.3. the data of XML format, the parsing of data is carried out using Dom4j open source JAVA kit, treatment process is similar to JSON lattice The analysis mode of formula.4. the data of other user-defined formats analyze its format, according to Character segmentation, regular expression matching etc. Mode carries out cutting parsing.

Claims (6)

1. the electric business big data acquisition system under a kind of big data scene based on business plug-in unit, it is characterised in that: the data Acquisition system is deployed in server-side and client, is divided into three functional modules: task management module, task processing module and data Processing module;
The task management module is received for distributing pending task by task schedule service by the task service of claiming Then task to be processed transfers to task analysis service to client, which parse to task definition and rear combine is appointed Business processing module management service goes to judge whether there is corresponding business processing plug-in unit, if it does not exist, then the industry for going downloading to respond Business plug-in unit;If it exists, then it carries out version number's comparison and carries out respective handling;
The task processing module is made of different specific business plug-in units, for being responsible for actual crawler task, using DLL General export function, which goes to execute, completes these business;
The data processing module, the content of pages for returning to task processing module parse, and arrange storage.
2. the electric business big data acquisition system under big data scene as described in claim 1 based on business plug-in unit, feature It is: is development platform exploitation by Delphi IDE Integrated Development frame in the task management module, the task handles mould Block is developed by Delphi language, and using Indy component as the basis of http communication, data processing module is by Java language It develops, carries out database purchase using MySQL.
3. the electric business big data acquisition system under big data scene as claimed in claim 1 or 2 based on business plug-in unit, special Sign is: in the task management module, according to the content of pages and data type of electric business platform different style, and to electric business The requirement analysis of web site contents has write different crawler capture programs, and is compiled into dynamic link library file and is deployed in the mould In block, in the way of task management, the task that third party pushes or is stored in database is beaten with the data format of XML Packet parses, the dynamic download of finishing service module by distribution.
4. the electric business big data acquisition system under big data scene as claimed in claim 3 based on business plug-in unit, feature Be: the mode of task management is realized by the striding course mechanism of message queue, is got rid of synchronous in high concurrent The blocking that processing occurs can be requested by using message queue with asynchronous process, alleviate the pressure of system.
5. the electric business big data acquisition system under big data scene as claimed in claim 1 or 2 based on business plug-in unit, special Sign is: the task processing module, goes to execute using the general export function of DLL and completes crawler business, is based on each electric business platform Restriction summarize the counter of each electric business platform by test, analysis and summary and climb means and corresponding solution.
6. the electric business big data acquisition system under big data scene as claimed in claim 1 or 2 based on business plug-in unit, special Sign is: the data processing module, for the different web sites page that crawler returns, designs different page parsing programs, leads to It crosses Auto-matching to go to execute resolving, and is different page content types, take a variety of mechanism for resolving.
CN201810905874.XA 2018-08-10 2018-08-10 A kind of electric business big data acquisition system based on business plug-in unit Pending CN109285046A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810905874.XA CN109285046A (en) 2018-08-10 2018-08-10 A kind of electric business big data acquisition system based on business plug-in unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810905874.XA CN109285046A (en) 2018-08-10 2018-08-10 A kind of electric business big data acquisition system based on business plug-in unit

Publications (1)

Publication Number Publication Date
CN109285046A true CN109285046A (en) 2019-01-29

Family

ID=65182754

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810905874.XA Pending CN109285046A (en) 2018-08-10 2018-08-10 A kind of electric business big data acquisition system based on business plug-in unit

Country Status (1)

Country Link
CN (1) CN109285046A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347441A (en) * 2019-06-25 2019-10-18 银江股份有限公司 A kind of thermal expansion data integration automotive engine system and method
CN110780983A (en) * 2019-09-10 2020-02-11 中国平安财产保险股份有限公司 Task exception handling method and device, computer equipment and storage medium
CN112800361A (en) * 2021-01-29 2021-05-14 麒麟合盛网络技术股份有限公司 Content acquisition method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7096220B1 (en) * 2000-05-24 2006-08-22 Reachforce, Inc. Web-based customer prospects harvester system
CN103268319A (en) * 2013-04-17 2013-08-28 北京航空航天大学 Cloud browser based on webpages
CN104765592A (en) * 2014-01-03 2015-07-08 任子行网络技术股份有限公司 Plugin management method and device facing web page acquisition task
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN107317724A (en) * 2017-06-06 2017-11-03 中证信用增进股份有限公司 Data collecting system and method based on cloud computing technology
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7096220B1 (en) * 2000-05-24 2006-08-22 Reachforce, Inc. Web-based customer prospects harvester system
CN103268319A (en) * 2013-04-17 2013-08-28 北京航空航天大学 Cloud browser based on webpages
CN104765592A (en) * 2014-01-03 2015-07-08 任子行网络技术股份有限公司 Plugin management method and device facing web page acquisition task
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN107317724A (en) * 2017-06-06 2017-11-03 中证信用增进股份有限公司 Data collecting system and method based on cloud computing technology
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347441A (en) * 2019-06-25 2019-10-18 银江股份有限公司 A kind of thermal expansion data integration automotive engine system and method
CN110780983A (en) * 2019-09-10 2020-02-11 中国平安财产保险股份有限公司 Task exception handling method and device, computer equipment and storage medium
CN112800361A (en) * 2021-01-29 2021-05-14 麒麟合盛网络技术股份有限公司 Content acquisition method and device

Similar Documents

Publication Publication Date Title
CN107895009B (en) Distributed internet data acquisition method and system
US10204035B1 (en) Systems, methods and devices for AI-driven automatic test generation
US10467316B2 (en) Systems and methods for web analytics testing and web development
US8725794B2 (en) Enhanced website tracking system and method
CN101044463B (en) Method and system for monitoring performance of a client-server architecture
CN104063401B (en) The method and apparatus that a kind of webpage pattern address merges
CN109285046A (en) A kind of electric business big data acquisition system based on business plug-in unit
CN106027330A (en) Message testing method for front-end system, and simulation baffle system
CN101222349A (en) Method and system for collecting web user action and performance data
US10250632B2 (en) Web service testing
CN105138454B (en) A kind of automated testing method for B/S framework security softwares
CN103490896B (en) Multi-user website automatic logger and achieving method thereof
WO2016082696A1 (en) Ua recognition method and device
CN107256276A (en) A kind of mobile App content safeties acquisition methods and equipment based on cloud platform
CN104750463B (en) A kind of developing plug method and system
CN106209863B (en) A kind of web portal security monitoring method based on whole station scanning
CN111651656B (en) Method and system for dynamic webpage crawler based on agent mode
CN110737645B (en) Data migration method and system among different systems and related equipment
CN110825641A (en) Micro-service application test tool set based on simulation data generator
Vogel et al. An in-depth analysis of web page structure and efficiency with focus on optimization potential for initial page load
US20150341381A1 (en) Apparatus and method for collecting harmful website information
CN108256106B (en) Simulation access website adapter system
CN109194670A (en) A kind of any file download leak detection method in website
Koçi et al. Improving web api usage logging
EP2937801A1 (en) Harmful site collection device and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190129

RJ01 Rejection of invention patent application after publication