CN105825399A - Internet based B2B e-commerce information collecting method - Google Patents

Internet based B2B e-commerce information collecting method Download PDF

Info

Publication number
CN105825399A
CN105825399A CN201610148102.7A CN201610148102A CN105825399A CN 105825399 A CN105825399 A CN 105825399A CN 201610148102 A CN201610148102 A CN 201610148102A CN 105825399 A CN105825399 A CN 105825399A
Authority
CN
China
Prior art keywords
information
url
collection
page
platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610148102.7A
Other languages
Chinese (zh)
Inventor
郑扬
陈静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201610148102.7A priority Critical patent/CN105825399A/en
Publication of CN105825399A publication Critical patent/CN105825399A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A B2B e-commerce information collecting method comprises the steps that 1) a url and content which need information collection are determined, and collected information includes key information such as client information, product information and industry information; 2) an anti-shielding assembly is established; and 3) a report formed by a collection result is sent to a visualized platform. The anti-shielding assembly is established in the following manners that 1) a proxy IP pool is maintained in which proxy IP are obtained by buying or collecting free proxy IP, updated and tested every day, and screened to obtain proxy IP of high connection speed; 2) the assembly is recalled after disconnection, when certain IP is forced out, the IP address can be replaced by restarting a router, and when connection of the proxy IP is overtime, redialing is carried out automatically and a new IP is used for capturing; and 3) pressure test is carried out by carrying out pressure test before each formal capture, and testing response of a website in different capturing frequencies to make a balance between the frequency and collectability.

Description

A kind of the Internet gathers B2B electricity business's information approach
Technical field
The present invention relates to a kind of by the Internet collection B2B electricity business's information approach, the method especially gathering rival's information.
Background technology
The present invention relates to network information gathering field, in particular to a kind of method gathering rival's information by the Internet.B2B electricity business refers to: based on ecommerce B2B platform and big data, the e-commerce initiative of the marketing relationship between enterprise is completed, including information and transactional services, commercial search engine service, financial service, cloud service etc. by integrating industrial chain upstream and downstream resource.China's electricity business's B2B market transaction scale 10 trillion yuan in 2014, main platform has Alibaba, intelligent clever tens of families such as net, made in China net.For practitioner, gathering rival's information by legal means, make business decision, to reach to know yourself as well as the enemy, you can fight a hundred battles with no danger of defeat is very important.
Url coding is a kind of browser form for list input of packing.Browser obtains all of name and value therein from list, and as a part of URL or they are issued server with name/value parameter coding (removing those characters that can not transmit, using data seniority among brothers and sisters etc.) discretely.For timely, legal, effectively obtain rival's information, it is proposed that a kind of gather rival information the method fed back by the Internet url.
Summary of the invention
The present invention seeks to, a kind of method gathering B2B electricity business's rival's information by the Internet is proposed, also it is a kind of B2B electricity business's rival's intelligence analysis method, it can realize carrying out B2B electricity business's rival's platform data every day situation information by performing timing acquiring task and relevant a series of configurations, and realized the contrast of opponent's information and its data by data pick-up, filtration, converting system.To understanding opponent's information it is known that self-growth can play considerable effect.
The technical scheme is that: a kind of method of B2B electricity business (rival) information gathering, including:
Step 1, determine the url and content needing to carry out information gathering, the key message such as including customer information, product information and trade information.
(1) configuration is needed to carry out the rule of information gathering url: information and url that the html page that the url of collection determines comprises needs collection can express with asterisk wildcard or regular expression, to facilitate extensive collection, avoid the same url of multi collect simultaneously.
The url of B2B electricity business's platform mainly includes homepage, product page, member's details page, industry catalogue page etc., has systematicness, and available asterisk wildcard, regular expression express the url of the dissimilar page.
Need configuration carry out information gathering url content include platform domain name, product page rule, member's page rule, product details page, member's details page etc..
(2) membership information collection:
B2B electricity business's platform service in a large amount of members, main member's sum of opponent, charge member's number, the free member's number of gathering, the province of member, city, industry distribution, every day all kinds of member quantity.
(3) trade information collection:
Product on B2B electricity business's platform relates to multiple industry, mainly gathers the industry catalogue of opponent, catalogue change conditions on B2B electricity business's platform.
(2), disposing url acquisition tasks, timing performs capture program.
Because B2B electricity business's platform competition opponent's information is all updating every day, corresponding capture program also to perform in every day.
And because B2B electricity business's platform has the magnanimity page, the promptness gathered needs the most timely more good simultaneously, but after same IP seismic acquisition frequency exceedes certain threshold value, gather mortality and be significantly increased.Using the scheme of distributed capture herein, acquisition tasks will be distributed to a computer cluster, in cluster, each computer obtains different IP addresses, cluster internal Real-Time Sharing acquisition state, has been achieved in that the promptness of information gathering, has improve successful.
Step 2, build anti-shield assembly, adopt with the following method simultaneously:
Mode 1, safeguard Agent IP pond.
Agent IP is by buying or gather the acquisition of free Agent IP, update every day and test, filter out the Agent IP that connection speed is high, adding Agent IP pond, when an Agent IP speed is inadequate, connects time-out and can automatically exit from Agent IP pond.
Mode 2, broken string redial assembly
After IP is closed down, the feature of restarter the most replaceable IP address, can the most again dial when Agent IP connects time-out, use new IP to capture.
Mode 3, stress test
Stress test should be done before the most formal crawl, test the reaction of website under difference captures frequency, to reach the balance of frequency and collection property.
Step 3, collection result is formed report it is sent to Visualization Platform;
The url of B2B electricity business's platform gathers and performs finally, to need to be published to collection result Visualization Platform, and Visualization Platform can carry out multidimensional analysis to collection information or data, forms analytical statement.Also have a set of corresponding log system simultaneously, facilitate collector to check acquisition and recording.
Further, build distributed parallel computing environment, use the cluster of multiple stage machine composition to carry out the collection of corresponding website.Wherein it is desired to the communication between attention machine, avoid same url by multiple stage machine repeated acquisition by url planning, gathered the collection failure that daily record solves to cause because of reasons such as network congestion, ip are closed by record;
Company information gathers: first extract the rule of company details page url, according to the url list that rule configuration is general, travels through url list, and the relevant information of Collection Co., Ltd, such as title, province, city, member's rank;
Product information gathers: first extract the rule of product details page url, according to the url list that rule configuration is general, travels through url list, gathers the relevant information of product, such as title, price, specification, attribute, affiliated industry catalogue etc..
Industry directory information gathers: industry catalogue quantity does not has navigation page very much, gathers industry catalogue at navigation page, gathers every day and contrasts with the last time, the change of monitoring rival's catalogue.
Beneficial effects of the present invention:
1, the key message of B2B electricity business rival every day can effectively be collected;
2, can in time the information of B2B electricity business rival be contrasted with self;
3, comprehensive system monitoring, improves the availability of system, also improves the efficiency of system maintenance;
4, multiple collection demand can have been unified by hardware and software platform, it is to avoid overlapping development, reduces cost.
Accompanying drawing explanation
A kind of B2B electricity business's rival's information collecting method process chart of Fig. 1 the present embodiment.
Detailed description of the invention
Such as Fig. 1, the present embodiment one B2B electricity business's rival's information collecting method handling process, including:
Step 1, determine the key messages such as url, customer information, product information and trade information to be collected.
Wherein it is determined that url basic principle to be collected is, information and url that the html page that this url determines comprises needs collection can express with asterisk wildcard or regular expression, to facilitate extensive collection, avoid the same url of multi collect simultaneously.
When determining the key messages such as customer information, product information and trade information to be collected, it should be noted that gather target according to business rational.
Step 2, build anti-shield assembly
Mode 1, safeguard Agent IP pond.
Agent IP is by buying or gather the acquisition of free Agent IP, update every day and test, filter out the Agent IP that connection speed is high, adding Agent IP pond, when an Agent IP speed is inadequate, connects time-out and can automatically exit from Agent IP pond.
Mode 2, broken string redial assembly
Adsl line IP is utilized not fix, after IP is closed down, the feature of restarter the most replaceable IP address, can the most again dial when Agent IP connects time-out, use new IP to capture.
Mode 3, stress test
Best anti-shielding mode still reduces frequency acquisition, so should do stress test before the most formal crawl, tests the reaction of website under difference captures frequency, to reach the balance of frequency and collection property.
Step 3, build distributed parallel computing environment, use the cluster of multiple stage machine composition to carry out the collection of corresponding website.
Wherein it is desired to the communication between attention machine, avoid same url by multiple stage machine repeated acquisition by url planning, gathered the collection failure that daily record solves to cause because of reasons such as network congestion, ip are closed by record.
Concrete grammar is as follows:
Company information: first extract the rule of company details page url, according to the url list that rule configuration is general, travel through url list, the relevant information of Collection Co., Ltd, such as title, province, city, member's rank etc., it it is the date that the said firm adds rival with the time point that first time collects.
Product information: first extract the rule of product details page url, according to the url list that rule configuration is general, travels through url list, gathers the relevant information of product, such as title, price, specification, attribute, affiliated industry catalogue etc..
Industry directory information: industry catalogue quantity is little, often has navigation page, can gather industry catalogue at navigation page, gathers every day and contrasts with the last time, can monitor the change of rival's catalogue.
Step 4, collection information is carried out multidimensional analysis, form analytical statement
Such as analyzing member's situation of rival, the information collected has:
1) member corporation's title, province, city, whether be charge member
2) time of collection information is entered for the first time
3) industry belonging to member
4) the product list of member
For above-mentioned information, member's quantity of rival, charge/free member's situation can be analyzed with time, address, industry as dimension, and contrast with self.
Step 5, generation report, be sent to senior executive.
B2B electricity business's rival's information gathered and analyzes the strategic decision likely affecting company, being therefore configured with a set of report automatic creation system, it is possible to analysing content is generated document, be periodically pushed to senior executive.

Claims (2)

1. a method for B2B electricity business information gathering, is characterized in that including:
Step 1, determine the url and content needing to carry out information gathering, the key message such as including customer information, product information and trade information;
(1) configuration is needed to carry out the rule of information gathering url: information and url that the html page that the url of collection determines comprises needs collection can express with asterisk wildcard or regular expression, to facilitate extensive collection, avoid the same url of multi collect simultaneously;
The url of B2B electricity business's platform mainly includes homepage, product page, member's details page, industry catalogue page etc., has systematicness, utilizes asterisk wildcard, regular expression to express the url of the dissimilar page;
Need configuration carry out information gathering url content include platform domain name, product page rule, member's page rule, product details page, member's details page etc.;
(2) membership information collection:
B2B electricity business's platform service in a large amount of members, main member's sum of opponent, charge member's number, the free member's number of gathering, the province of member, city, industry distribution, every day all kinds of member quantity;
(3) trade information collection:
Product on B2B electricity business's platform relates to multiple industry, mainly gathers the industry catalogue of opponent, catalogue change conditions on B2B electricity business's platform;
(2), disposing url acquisition tasks, timing performs capture program;
Because B2B electricity business's platform competition opponent's information is all updating every day, corresponding capture program also to perform in every day;
Using the scheme of distributed capture, acquisition tasks will be distributed to a computer cluster, in cluster, each computer obtains different IP addresses, cluster internal Real-Time Sharing acquisition state, has been achieved in that the promptness of information gathering, has improve successful;
Step 2, build anti-shield assembly, adopt with the following method simultaneously:
Mode 1, safeguard Agent IP pond;
Agent IP is by buying or gather the acquisition of free Agent IP, update every day and test, filter out the Agent IP that connection speed is high, adding Agent IP pond, when an Agent IP speed is inadequate, connects time-out and can automatically exit from Agent IP pond;
Mode 2, broken string redial assembly
After IP is closed down, the feature of restarter the most replaceable IP address, can the most again dial when Agent IP connects time-out, use new IP to capture;
Mode 3, stress test
Stress test should be done before the most formal crawl, test the reaction of website under difference captures frequency, to reach the balance of frequency and collection property;
Step 3, collection result is formed report it is sent to Visualization Platform;
The url of B2B electricity business's platform gathers and performs finally, to need to be published to collection result Visualization Platform, and Visualization Platform can carry out multidimensional analysis to collection information or data, forms analytical statement;Also have a set of corresponding log system simultaneously, facilitate collector to check acquisition and recording.
Analytical statement is made up of three parts.Part I is self and the contrast of the overview of rival, the block diagram of loss member quantitative comparison accumulative, newly-increased including the member of temporally dimension statistics and detail list.
Part II is the industry contrast of self and rival, and member's quantity, member's number including every profession and trade add up accounting, newly-increased member's quantity, newly-increased member's accounting, loss member's quantity, loss member's accounting.
Part III is city, the province contrast of self and rival, and member's quantity, member's number including city, each province add up accounting, newly-increased member's quantity, newly-increased member's accounting, loss member's quantity, loss member's accounting, member's multiple proportion.
Data based on pie chart, block diagram and detail list, both can intuitively reflect trend, accounting, and also can check concrete data.
The method of information gathering the most according to claim 1, is characterized in that building distributed parallel computing environment, uses the cluster of multiple stage machine composition to carry out the collection of corresponding website;Wherein it is desired to the communication between attention machine, avoid same url by multiple stage machine repeated acquisition by url planning, gathered the collection failure that daily record solves to cause because of reasons such as network congestion, ip are closed by record;
Company information gathers: first extract the rule of company details page url, according to the url list that rule configuration is general, travels through url list, and the relevant information of Collection Co., Ltd, such as title, province, city, member's rank;
Product information gathers: first extract the rule of product details page url, according to the url list that rule configuration is general, travels through url list, gathers the relevant information of product, such as title, price, specification, attribute, affiliated industry catalogue etc.;
Industry directory information gathers: industry catalogue quantity does not has navigation page very much, gathers industry catalogue at navigation page, gathers every day and contrasts with the last time, the change of monitoring rival's catalogue.
CN201610148102.7A 2016-03-15 2016-03-15 Internet based B2B e-commerce information collecting method Pending CN105825399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610148102.7A CN105825399A (en) 2016-03-15 2016-03-15 Internet based B2B e-commerce information collecting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610148102.7A CN105825399A (en) 2016-03-15 2016-03-15 Internet based B2B e-commerce information collecting method

Publications (1)

Publication Number Publication Date
CN105825399A true CN105825399A (en) 2016-08-03

Family

ID=56987788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610148102.7A Pending CN105825399A (en) 2016-03-15 2016-03-15 Internet based B2B e-commerce information collecting method

Country Status (1)

Country Link
CN (1) CN105825399A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790674A (en) * 2017-01-25 2017-05-31 杭州淘淘搜科技有限公司 A kind of Agent cluster service system and method based on ADSL
CN110798470A (en) * 2019-10-31 2020-02-14 北京金堤科技有限公司 Proxy IP address management method and system
CN111179027A (en) * 2019-12-16 2020-05-19 付毅 Processing method and device of merchant information, computer equipment and storage medium
CN112822306A (en) * 2021-02-03 2021-05-18 深圳前海微众银行股份有限公司 Proxy IP dynamic management method, device and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101441629A (en) * 2007-11-19 2009-05-27 上海新纳广告传媒有限公司 Automatic acquiring method of non-structured web page information
CN104778273A (en) * 2015-04-24 2015-07-15 淘金信息科技江苏有限公司 Big data analysis system for shopping website
CN105205686A (en) * 2015-08-18 2015-12-30 浪潮软件集团有限公司 Method and system for obtaining product price information
US20160027043A1 (en) * 2007-05-22 2016-01-28 Kount Inc. Collecting Information Regarding Consumer Click-Through Traffic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160027043A1 (en) * 2007-05-22 2016-01-28 Kount Inc. Collecting Information Regarding Consumer Click-Through Traffic
CN101441629A (en) * 2007-11-19 2009-05-27 上海新纳广告传媒有限公司 Automatic acquiring method of non-structured web page information
CN104778273A (en) * 2015-04-24 2015-07-15 淘金信息科技江苏有限公司 Big data analysis system for shopping website
CN105205686A (en) * 2015-08-18 2015-12-30 浪潮软件集团有限公司 Method and system for obtaining product price information

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790674A (en) * 2017-01-25 2017-05-31 杭州淘淘搜科技有限公司 A kind of Agent cluster service system and method based on ADSL
CN110798470A (en) * 2019-10-31 2020-02-14 北京金堤科技有限公司 Proxy IP address management method and system
CN111179027A (en) * 2019-12-16 2020-05-19 付毅 Processing method and device of merchant information, computer equipment and storage medium
CN112822306A (en) * 2021-02-03 2021-05-18 深圳前海微众银行股份有限公司 Proxy IP dynamic management method, device and computer readable storage medium
CN112822306B (en) * 2021-02-03 2022-07-29 深圳前海微众银行股份有限公司 Proxy IP dynamic management method, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
US8656006B2 (en) Integrating traffic monitoring data and application runtime data
US7805510B2 (en) Hierarchy for characterizing interactions with an application
CN101192227B (en) Log file analytical method and system based on distributed type computing network
US9009680B2 (en) Selecting instrumentation points for an application
CN102780726B (en) A kind of log analysis method based on WEB platform and system
US11347625B1 (en) Rendering a service graph illustrate page provider dependencies at query time using exemplars
CN105825399A (en) Internet based B2B e-commerce information collecting method
CN102724059A (en) Website operation state monitoring and abnormal detection based on MapReduce
CN103455636A (en) Automatic capturing and intelligent analyzing method based on Internet tax data
US8438276B1 (en) Method of monitoring network and application performance by analyzing web clients and web servers
CN1677403A (en) System and method for automated optimization of search result relevance
CN105224691B (en) A kind of information processing method and device
US11838372B2 (en) URL normalization for rendering a service graph
Reddy et al. Preprocessing the web server logs: an illustrative approach for effective usage mining
CN108574669A (en) User behavior tree constructing method and device
CN108108445A (en) A kind of data intelligence processing method and system
CN114648393A (en) Data mining method, system and equipment applied to bidding
Suchacka Analysis of aggregated bot and human traffic on e-commerce site
WO2022086610A1 (en) End-to-end visibility of a user session
CN106156232A (en) A kind of monitoring method and apparatus of spreading network information
US11693850B1 (en) Rendering a service graph to illustrate page provider dependencies at an aggregate level
US11392605B1 (en) Integration in computer analytics system
Shrivastava et al. Extracting knowledge from user access logs
Rudra et al. Linux-like Socket Statistics Utility for ns-3
Ran et al. Research on Data Acquisition Strategy and Its Application in Web Usage Mining

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160803