CN105868412B - A kind of multi-thread data grasping means based on B2B platform - Google Patents

A kind of multi-thread data grasping means based on B2B platform Download PDF

Info

Publication number
CN105868412B
CN105868412B CN201610272886.4A CN201610272886A CN105868412B CN 105868412 B CN105868412 B CN 105868412B CN 201610272886 A CN201610272886 A CN 201610272886A CN 105868412 B CN105868412 B CN 105868412B
Authority
CN
China
Prior art keywords
thread
data
content
file
platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610272886.4A
Other languages
Chinese (zh)
Other versions
CN105868412A (en
Inventor
徐飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201610272886.4A priority Critical patent/CN105868412B/en
Publication of CN105868412A publication Critical patent/CN105868412A/en
Application granted granted Critical
Publication of CN105868412B publication Critical patent/CN105868412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The multi-thread data grasping means based on B2B platform that the invention discloses a kind of, 1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed;2) goal rule of needs is analyzed in URL handler;3) by HTTP resolver, the data of certain format are obtained from tunnel protocol;4) new thread is distributed each URL request, multithread mode concurrent processing is passed through;5) by class manager be managed each commodity major class to group rule;6) to the setting of thread time-out;7) according to the fixed format data, the commodity data content of all extractions is saved to database;Multi-thread data grasping means provided by the invention based on B2B platform has significant effect on big data concurrently acquires in real time, multi-thread data grabs.

Description

A kind of multi-thread data grasping means based on B2B platform
Technical field
The multi-thread data grasping means based on B2B platform that the present invention relates to a kind of.
Background technique
E-commerce development so far, in terms of understanding rival, every terms of information content, including product, sales volume, user Number etc.;These data are only obtained on platform by technological means.
When information products information or upgrading products information occurs in rival website, pass through the more of our B2B platforms Thread-data grasping means, by all data of very effective acquisition rival.
Domestic existing data grab method, especially for the acquisition of B2B platform, and in big data concurrent In, then in the case where real-time and big data quantity, be easy to appear many problems or not can guarantee real-time, such as:
Chinese patent CN201210141520.5 gives a kind of data grabber system, and the system comprises hook loads Module, crawl hook module, configuration file generation module.Hook loading module is to generate setting quantity according to number of processes Hook module is grabbed, the process for needing to grab is allocated to.Hook module is grabbed to monitor business datum in its correspondence process Transmitting, and grab corresponding business datum.The data grabber system of proposition can conveniently and efficiently grab other C/S structured walk-throughs The data of system, and it is supplied to other operation system typings.The control data in the WINDOWS window in C/S framework can be grabbed, Window data crawl is carried out to other operation systems, and file is written into according to configurable format in the data grabbed, is provided To the input of the data of other systems.This method belongs to carries out data grabber in C/S framework, can not be suitable for webpage and In the website of B2B.
Chinese patent CN201510378181.6, the method for proposition include: to receive for number of targets needed for each platform According to respectively arranged data grabber parameter;Corresponding data are executed according to the data grabber parameter for the setting of each platform Rules for grasping, target data needed for grabbing the platform on the internet;The target data of crawl is shown;It connects The screening operation to the target data of displaying is received, and the target data after screening is published to the page of the platform In the prefecture of face.The operating procedure that operation personnel obtains target data is simplified, the work that operation personnel obtains target data is reduced It measures, while the quality and quantity of the article information of publication is greatly improved, average every operation personnel can issue daily The quantity of high-quality article greatly increases.This method mainly solves the operational issue of target data screening, and to proprietary number of targets According to being handled, the content can not be implemented in B2B websites.
Summary of the invention
Goal of the invention: it grabs and the calling of hiding content is obtained, present invention offer to solve B2B websites in multithreading A kind of multi-thread data grasping means based on B2B platform is called for B2B platform multistage, and nesting allocation obtains the number of content According to crawl and the grasping means of implicit content.The multi-thread data grasping means of B2B platform is directed to each script file, including Pattern, nesting allocation etc. is difficult to solve the problems, such as to obtain data, and efficiently solves the problems, such as this by this method.
The technical scheme is that a kind of multi-thread data grasping means based on B2B platform, it is characterised in that: packet Include following steps:
(1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed, The method that the commodity data of B2B platform data source is acquired in real time are as follows:
The useless information content is rejected in source file, rejecting mode is by our tag library come recursive call;
Source file analyzes effective URL, and URL is transferred to next step;
(2) analyzed in URL handler the goal rule that needs, try again URL request, obtains next layer of source Source code content is placed buffer area, the content of buffer area is then transferred to next goal task and is handled by digital content;To this Source code content partial code filtering wherein unwanted content, such as advertising information, copyright information, label label, to reach Puppet deposits the purpose of essence;The rule of source code content circulation paging is found, and source code content is separated.
(3) by HTTP resolver, data (key-value pair form data, the packaging of certain format are obtained from tunnel protocol The structured data of structure of arrays data and characteristic character), it extracts as target data required for us as a result, again Format screening is carried out to the result, removes idle character;Multiple threads for issuing HTTP request are opened, each URL request is only asked The same format partial content of resource file is sought, the file of per thread downloading is merged;In order to reduce transmission in many cases Request, while in many cases can no need to send whole html contents.The quantity for reducing network loop, decreases The bandwidth of network application.
(4) thread process distributes each URL request new thread, passes through multithread mode concurrent processing.
Multithreading call operation is realized by thread manager, will be hung up automatically when the failure of a certain thread;And it is each Request needs an individual thread to complete;In thread pool, Thread Count is usually fixation, and total number of threads is no more than thread pool The middle number that can accommodate thread, then request thread sum is not more than when handling these requests when server does not utilize thread pool 50000;
(5) class manager is managed each commodity major class to the rule of group by class manager, works as data After matching rule success, major class commodity source code, then recursive call subclass commodity sound code file will be obtained first.
(6) by the setting to thread time-out, when can't detect whether the thread executes into the time-out interval time Function, then being configured thread process label to fail, when the automatic trigger thread opens again in system idle state It is dynamic.Thread pool will directly be handled based on identification, or increase the processing of worker's number, and into queue to be processed, other thread pools can be straight It connects and task is put into queue to be processed, worker thread is waited to go to take out execution.
(7) pass through fixed format, refer to key-value pair form data, pack the structure of array structured data and characteristic character Data format saves the commodity data content of all extractions into database.
The utility model has the advantages that the present invention provides a kind of multi-thread data grasping means based on B2B platform, for B2B platform Multistage is called, and nesting allocation obtains the data grabber of content and the grasping means of implicit content, and is directed to each script file, It is difficult to solve the problems, such as to obtain data including pattern, nesting allocation etc., efficiently solves the problems, such as this.
Detailed description of the invention
Fig. 1 is the implementation flow chart of the method for the present invention.
Specific embodiment
The present invention is based on the multi-thread data grasping means of B2B platform, include the following steps:
(1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed, The method that the data of data source are acquired in real time are as follows:
The useless information content is rejected in source file, rejecting mode is by tag library come recursive call;
The tag library is that a series of Html often uses tag characters type, and the type is deposited in memory array, is passed through Recursive call matching, finally rejects garbage.
Source file analyzes effective URL, and URL is transferred to next step;
(2) goal rule of our needs is analyzed in URL handler, try again URL request, obtains next layer Source code content, places the content at buffer area, and the content of buffer area is transferred to next goal task and handled by us;
To partial code filtering wherein unwanted content, such as advertising information, copyright information, label label, to reach To the purpose for going puppet to deposit essence.The rule of circulation paging is found, and content is separated.
(3) by HTTP resolver, the data of certain format, i.e. key-value pair form data, packet are obtained from tunnel protocol The structured data of dress structure of arrays data and characteristic character extracts as target data required for us as a result, again Format screening is carried out to the result.Multiple threads for issuing HTTP request are opened, each HTTP request simply requests resource file A part merges the file of per thread downloading, sends request to reduce in many cases, while in many cases may be used With no need to send complete responses.The quantity for reducing network loop decreases the bandwidth of network application.
(4) thread process distributes new thread to each url request, passes through multithread mode concurrent processing.
Multithreading call operation is realized by thread manager, will be hung up automatically when thread failure.When a server When handling within one day 50000 requests, and each request needs an individual thread to complete.In thread pool, Thread Count Usually fixed, total number of threads is no more than the number of thread in thread pool, handles these when server does not utilize thread pool Then total number of threads is 50000 for request.General thread pool size is much smaller than 50000.It will not be in order to create 50000 using thread pool Thread and handle request when waste time, to improve efficiency.
(5) class manager is managed each commodity major class to the rule of group by class manager, works as data After matching rule success, major class source code, then recursive call subclass sound code file will be obtained first.
(6) by the setting to thread time-out, when can't detect whether the thread executes into the time-out interval time Function, then being configured thread process label to fail, when the automatic trigger thread opens again in system idle state It is dynamic.Thread pool will directly be handled based on identification, or increase the processing of worker's number, and into queue to be processed, other thread pools can be straight It connects and task is put into queue to be processed, worker thread is waited to go to take out execution.
(7) content of all extractions is saved into database by the fixed format.
The step of above-mentioned HTTP is parsed, (1) parse Html hypertext markup language source file, including js foot in Html file This document and css file etc., are therefrom handled;(2) handled by source file resolver, in various formats into Row dissection process;(3) the hiding content that js script returns is obtained by http packet handler;Http is hypertext transfer protocol, It is mainly used for transmitting hypertext to local browser from www server;(4) matching and processing of content are hidden;(5) final data (merging) is integrated in processing;(6) the new task of thread manager concurrent processing.
The dissection process, by several rules, including canonical, irregular structure of arrays, according to the mode of key-value pair into Row matching treatment exports matching result.
The hiding content is to access the data obtained by one group of URL, and current rule is traversed out from the data In required data, by this hide content carry out dissection process, export matching result.
The matching result that final data will respectively be deposited in memory extracts result data by way of key-value pair, then It is packaged in array, row data is deposited in database.
Although the present invention has been disclosed as a preferred embodiment, however, it is not to limit the invention.Skill belonging to the present invention Has usually intellectual in art field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations.Cause This, the scope of protection of the present invention is defined by those of the claims.

Claims (1)

1. a kind of multi-thread data grasping means based on B2B platform, characterized by the following steps:
(1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed, it is right The method that the commodity data of B2B platform data source is acquired in real time are as follows: reject the useless information content in source file, reject Mode by tag library carry out recursive call;Source file analyzes effective URL, and URL is transferred to next step;
(2) goal rule that needs are analyzed in URL handler, try again URL request, obtains in next layer of source code Hold, source code content is placed into buffer area, the content of buffer area is then transferred to next goal task and is handled;To the source code Content partial code filtering wherein unwanted content, including advertising information, copyright information, label label, is deposited with reaching puppet The purpose of essence;The rule of source code content circulation paging is found, and source code content is separated;
(3) by HTTP resolver, the data of certain format are obtained from tunnel protocol, certain format is expressed as key-value pair form Data, pack structure of arrays data and characteristic character structured data;Target data knot required for data are extracted as Fruit carries out format screening to the result again, removes idle character;Multiple threads for issuing HTTP request are opened, each URL is asked The a part for simply requesting resource file is sought, the file of per thread downloading is merged;
(4) thread process distributes each URL request new thread, passes through multithread mode concurrent processing;
Multithreading call operation is realized by thread manager, will be hung up automatically when the failure of a certain thread;And each request An individual thread is needed to complete;In thread pool, Thread Count be it is fixed, total number of threads be no more than thread pool in can hold Receive the number of thread, then request thread sum is not more than 50000 when handling these requests when server does not utilize thread pool;
(5) class manager is managed each commodity major class to the rule of group by class manager, works as Data Matching After rule success, major class commodity source code, then recursive call subclass commodity sound code file will be obtained first;
(6) by the setting to thread time-out, when can't detect whether thread runs succeeded in overtime interval time, then right Thread process label is configured to fail, when in system idle state, the automatic trigger thread restarts;Thread pool It will directly be handled based on identification, or increase the processing of worker's number, into queue to be processed, other thread pools can directly be put task Enter queue to be processed, worker thread is waited to go to take out execution;
(7) according to the certain format data, the commodity data content of all extractions is saved into database;
HTTP parse the step of, (1) parse Html hypertext markup language source file, including in Html file js script file and Css file, is therefrom handled;(2) it is handled by source file resolver, carries out dissection process in various formats;(3) The hiding content that js script returns is obtained by http packet handler;(4) matching and processing of content are hidden;(5) final data Processing integration;(6) the new task of thread manager concurrent processing.
CN201610272886.4A 2016-04-28 2016-04-28 A kind of multi-thread data grasping means based on B2B platform Active CN105868412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610272886.4A CN105868412B (en) 2016-04-28 2016-04-28 A kind of multi-thread data grasping means based on B2B platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610272886.4A CN105868412B (en) 2016-04-28 2016-04-28 A kind of multi-thread data grasping means based on B2B platform

Publications (2)

Publication Number Publication Date
CN105868412A CN105868412A (en) 2016-08-17
CN105868412B true CN105868412B (en) 2019-05-03

Family

ID=56629472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610272886.4A Active CN105868412B (en) 2016-04-28 2016-04-28 A kind of multi-thread data grasping means based on B2B platform

Country Status (1)

Country Link
CN (1) CN105868412B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577525B (en) * 2017-08-22 2020-11-17 努比亚技术有限公司 Method and device for creating concurrent threads and computer-readable storage medium
CN109101440B (en) * 2018-08-01 2023-08-04 浪潮软件集团有限公司 Method for processing trace data concurrent request based on JVM (Java virtual machine) cache
CN109408695A (en) * 2018-09-27 2019-03-01 苏州创旅天下信息技术有限公司 Competing product data grab method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968495A (en) * 2012-11-29 2013-03-13 河海大学 Vertical search engine and method for searching contrast association shopping information
KR101374533B1 (en) * 2013-04-17 2014-03-14 주식회사 엔써티 High performance replication system and backup system for mass storage data, method of the same
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN104583949A (en) * 2012-08-16 2015-04-29 高通股份有限公司 Pre-processing of scripts in web browsers

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104583949A (en) * 2012-08-16 2015-04-29 高通股份有限公司 Pre-processing of scripts in web browsers
CN102968495A (en) * 2012-11-29 2013-03-13 河海大学 Vertical search engine and method for searching contrast association shopping information
KR101374533B1 (en) * 2013-04-17 2014-03-14 주식회사 엔써티 High performance replication system and backup system for mass storage data, method of the same
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system

Also Published As

Publication number Publication date
CN105868412A (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN103645939B (en) A kind of method and system of picture crawl
CN105868412B (en) A kind of multi-thread data grasping means based on B2B platform
CN106095585B (en) Task requests processing method, device and enterprise information system
US7318056B2 (en) System and method for performing click stream analysis
CN107590188A (en) A kind of reptile crawling method and its management system for automating vertical subdivision field
CN106844018A (en) A kind of task processing method, apparatus and system
CN106371975B (en) A kind of O&M automation method for early warning and system
CN107729214A (en) A kind of visual distributed system monitors O&M method and device in real time
CN109325161A (en) Public sentiment data grasping means, device, equipment and storage medium
CN105843893B (en) Monitoring method and device based on the software update information that Web information extracts
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
CN110457333B (en) Data real-time updating method and device and computer readable storage medium
CN112365157A (en) Intelligent dispatching method, device, equipment and storage medium
CN110417873A (en) A kind of network information extraction system for realizing record webpage interactive operation
CN103559097B (en) The method of interprocess communication, device and browser in a kind of browser
CN107341685A (en) Data analysing method and device
CN108650546A (en) Barrage processing method, computer readable storage medium and electronic equipment
CN111814192A (en) Training sample generation method and device and sensitive information detection method and device
CN108519908A (en) A kind of task dynamic management approach and device
CN108182595A (en) A kind of formulation migration efficiency method and device
CN109783330A (en) Log processing method, display methods and relevant apparatus, system
CN110716774A (en) Data driving method, system and storage medium for brain of financial business data
CN102055620B (en) Method and system for monitoring user experience
CN109408763A (en) The method and system that the resume of a kind of pair of different templates is managed
CN109558887A (en) A kind of method and apparatus of predictive behavior

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant