CN105868412B - A kind of multi-thread data grasping means based on B2B platform - Google Patents
A kind of multi-thread data grasping means based on B2B platform Download PDFInfo
- Publication number
- CN105868412B CN105868412B CN201610272886.4A CN201610272886A CN105868412B CN 105868412 B CN105868412 B CN 105868412B CN 201610272886 A CN201610272886 A CN 201610272886A CN 105868412 B CN105868412 B CN 105868412B
- Authority
- CN
- China
- Prior art keywords
- thread
- data
- content
- file
- platform
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The multi-thread data grasping means based on B2B platform that the invention discloses a kind of, 1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed;2) goal rule of needs is analyzed in URL handler;3) by HTTP resolver, the data of certain format are obtained from tunnel protocol;4) new thread is distributed each URL request, multithread mode concurrent processing is passed through;5) by class manager be managed each commodity major class to group rule;6) to the setting of thread time-out;7) according to the fixed format data, the commodity data content of all extractions is saved to database;Multi-thread data grasping means provided by the invention based on B2B platform has significant effect on big data concurrently acquires in real time, multi-thread data grabs.
Description
Technical field
The multi-thread data grasping means based on B2B platform that the present invention relates to a kind of.
Background technique
E-commerce development so far, in terms of understanding rival, every terms of information content, including product, sales volume, user
Number etc.;These data are only obtained on platform by technological means.
When information products information or upgrading products information occurs in rival website, pass through the more of our B2B platforms
Thread-data grasping means, by all data of very effective acquisition rival.
Domestic existing data grab method, especially for the acquisition of B2B platform, and in big data concurrent
In, then in the case where real-time and big data quantity, be easy to appear many problems or not can guarantee real-time, such as:
Chinese patent CN201210141520.5 gives a kind of data grabber system, and the system comprises hook loads
Module, crawl hook module, configuration file generation module.Hook loading module is to generate setting quantity according to number of processes
Hook module is grabbed, the process for needing to grab is allocated to.Hook module is grabbed to monitor business datum in its correspondence process
Transmitting, and grab corresponding business datum.The data grabber system of proposition can conveniently and efficiently grab other C/S structured walk-throughs
The data of system, and it is supplied to other operation system typings.The control data in the WINDOWS window in C/S framework can be grabbed,
Window data crawl is carried out to other operation systems, and file is written into according to configurable format in the data grabbed, is provided
To the input of the data of other systems.This method belongs to carries out data grabber in C/S framework, can not be suitable for webpage and
In the website of B2B.
Chinese patent CN201510378181.6, the method for proposition include: to receive for number of targets needed for each platform
According to respectively arranged data grabber parameter;Corresponding data are executed according to the data grabber parameter for the setting of each platform
Rules for grasping, target data needed for grabbing the platform on the internet;The target data of crawl is shown;It connects
The screening operation to the target data of displaying is received, and the target data after screening is published to the page of the platform
In the prefecture of face.The operating procedure that operation personnel obtains target data is simplified, the work that operation personnel obtains target data is reduced
It measures, while the quality and quantity of the article information of publication is greatly improved, average every operation personnel can issue daily
The quantity of high-quality article greatly increases.This method mainly solves the operational issue of target data screening, and to proprietary number of targets
According to being handled, the content can not be implemented in B2B websites.
Summary of the invention
Goal of the invention: it grabs and the calling of hiding content is obtained, present invention offer to solve B2B websites in multithreading
A kind of multi-thread data grasping means based on B2B platform is called for B2B platform multistage, and nesting allocation obtains the number of content
According to crawl and the grasping means of implicit content.The multi-thread data grasping means of B2B platform is directed to each script file, including
Pattern, nesting allocation etc. is difficult to solve the problems, such as to obtain data, and efficiently solves the problems, such as this by this method.
The technical scheme is that a kind of multi-thread data grasping means based on B2B platform, it is characterised in that: packet
Include following steps:
(1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed,
The method that the commodity data of B2B platform data source is acquired in real time are as follows:
The useless information content is rejected in source file, rejecting mode is by our tag library come recursive call;
Source file analyzes effective URL, and URL is transferred to next step;
(2) analyzed in URL handler the goal rule that needs, try again URL request, obtains next layer of source
Source code content is placed buffer area, the content of buffer area is then transferred to next goal task and is handled by digital content;To this
Source code content partial code filtering wherein unwanted content, such as advertising information, copyright information, label label, to reach
Puppet deposits the purpose of essence;The rule of source code content circulation paging is found, and source code content is separated.
(3) by HTTP resolver, data (key-value pair form data, the packaging of certain format are obtained from tunnel protocol
The structured data of structure of arrays data and characteristic character), it extracts as target data required for us as a result, again
Format screening is carried out to the result, removes idle character;Multiple threads for issuing HTTP request are opened, each URL request is only asked
The same format partial content of resource file is sought, the file of per thread downloading is merged;In order to reduce transmission in many cases
Request, while in many cases can no need to send whole html contents.The quantity for reducing network loop, decreases
The bandwidth of network application.
(4) thread process distributes each URL request new thread, passes through multithread mode concurrent processing.
Multithreading call operation is realized by thread manager, will be hung up automatically when the failure of a certain thread;And it is each
Request needs an individual thread to complete;In thread pool, Thread Count is usually fixation, and total number of threads is no more than thread pool
The middle number that can accommodate thread, then request thread sum is not more than when handling these requests when server does not utilize thread pool
50000;
(5) class manager is managed each commodity major class to the rule of group by class manager, works as data
After matching rule success, major class commodity source code, then recursive call subclass commodity sound code file will be obtained first.
(6) by the setting to thread time-out, when can't detect whether the thread executes into the time-out interval time
Function, then being configured thread process label to fail, when the automatic trigger thread opens again in system idle state
It is dynamic.Thread pool will directly be handled based on identification, or increase the processing of worker's number, and into queue to be processed, other thread pools can be straight
It connects and task is put into queue to be processed, worker thread is waited to go to take out execution.
(7) pass through fixed format, refer to key-value pair form data, pack the structure of array structured data and characteristic character
Data format saves the commodity data content of all extractions into database.
The utility model has the advantages that the present invention provides a kind of multi-thread data grasping means based on B2B platform, for B2B platform
Multistage is called, and nesting allocation obtains the data grabber of content and the grasping means of implicit content, and is directed to each script file,
It is difficult to solve the problems, such as to obtain data including pattern, nesting allocation etc., efficiently solves the problems, such as this.
Detailed description of the invention
Fig. 1 is the implementation flow chart of the method for the present invention.
Specific embodiment
The present invention is based on the multi-thread data grasping means of B2B platform, include the following steps:
(1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed,
The method that the data of data source are acquired in real time are as follows:
The useless information content is rejected in source file, rejecting mode is by tag library come recursive call;
The tag library is that a series of Html often uses tag characters type, and the type is deposited in memory array, is passed through
Recursive call matching, finally rejects garbage.
Source file analyzes effective URL, and URL is transferred to next step;
(2) goal rule of our needs is analyzed in URL handler, try again URL request, obtains next layer
Source code content, places the content at buffer area, and the content of buffer area is transferred to next goal task and handled by us;
To partial code filtering wherein unwanted content, such as advertising information, copyright information, label label, to reach
To the purpose for going puppet to deposit essence.The rule of circulation paging is found, and content is separated.
(3) by HTTP resolver, the data of certain format, i.e. key-value pair form data, packet are obtained from tunnel protocol
The structured data of dress structure of arrays data and characteristic character extracts as target data required for us as a result, again
Format screening is carried out to the result.Multiple threads for issuing HTTP request are opened, each HTTP request simply requests resource file
A part merges the file of per thread downloading, sends request to reduce in many cases, while in many cases may be used
With no need to send complete responses.The quantity for reducing network loop decreases the bandwidth of network application.
(4) thread process distributes new thread to each url request, passes through multithread mode concurrent processing.
Multithreading call operation is realized by thread manager, will be hung up automatically when thread failure.When a server
When handling within one day 50000 requests, and each request needs an individual thread to complete.In thread pool, Thread Count
Usually fixed, total number of threads is no more than the number of thread in thread pool, handles these when server does not utilize thread pool
Then total number of threads is 50000 for request.General thread pool size is much smaller than 50000.It will not be in order to create 50000 using thread pool
Thread and handle request when waste time, to improve efficiency.
(5) class manager is managed each commodity major class to the rule of group by class manager, works as data
After matching rule success, major class source code, then recursive call subclass sound code file will be obtained first.
(6) by the setting to thread time-out, when can't detect whether the thread executes into the time-out interval time
Function, then being configured thread process label to fail, when the automatic trigger thread opens again in system idle state
It is dynamic.Thread pool will directly be handled based on identification, or increase the processing of worker's number, and into queue to be processed, other thread pools can be straight
It connects and task is put into queue to be processed, worker thread is waited to go to take out execution.
(7) content of all extractions is saved into database by the fixed format.
The step of above-mentioned HTTP is parsed, (1) parse Html hypertext markup language source file, including js foot in Html file
This document and css file etc., are therefrom handled;(2) handled by source file resolver, in various formats into
Row dissection process;(3) the hiding content that js script returns is obtained by http packet handler;Http is hypertext transfer protocol,
It is mainly used for transmitting hypertext to local browser from www server;(4) matching and processing of content are hidden;(5) final data
(merging) is integrated in processing;(6) the new task of thread manager concurrent processing.
The dissection process, by several rules, including canonical, irregular structure of arrays, according to the mode of key-value pair into
Row matching treatment exports matching result.
The hiding content is to access the data obtained by one group of URL, and current rule is traversed out from the data
In required data, by this hide content carry out dissection process, export matching result.
The matching result that final data will respectively be deposited in memory extracts result data by way of key-value pair, then
It is packaged in array, row data is deposited in database.
Although the present invention has been disclosed as a preferred embodiment, however, it is not to limit the invention.Skill belonging to the present invention
Has usually intellectual in art field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations.Cause
This, the scope of protection of the present invention is defined by those of the claims.
Claims (1)
1. a kind of multi-thread data grasping means based on B2B platform, characterized by the following steps:
(1) using the homepage of B2B platform, multiple-stage type bibliographic structure as target object, webpage source file is analyzed, it is right
The method that the commodity data of B2B platform data source is acquired in real time are as follows: reject the useless information content in source file, reject
Mode by tag library carry out recursive call;Source file analyzes effective URL, and URL is transferred to next step;
(2) goal rule that needs are analyzed in URL handler, try again URL request, obtains in next layer of source code
Hold, source code content is placed into buffer area, the content of buffer area is then transferred to next goal task and is handled;To the source code
Content partial code filtering wherein unwanted content, including advertising information, copyright information, label label, is deposited with reaching puppet
The purpose of essence;The rule of source code content circulation paging is found, and source code content is separated;
(3) by HTTP resolver, the data of certain format are obtained from tunnel protocol, certain format is expressed as key-value pair form
Data, pack structure of arrays data and characteristic character structured data;Target data knot required for data are extracted as
Fruit carries out format screening to the result again, removes idle character;Multiple threads for issuing HTTP request are opened, each URL is asked
The a part for simply requesting resource file is sought, the file of per thread downloading is merged;
(4) thread process distributes each URL request new thread, passes through multithread mode concurrent processing;
Multithreading call operation is realized by thread manager, will be hung up automatically when the failure of a certain thread;And each request
An individual thread is needed to complete;In thread pool, Thread Count be it is fixed, total number of threads be no more than thread pool in can hold
Receive the number of thread, then request thread sum is not more than 50000 when handling these requests when server does not utilize thread pool;
(5) class manager is managed each commodity major class to the rule of group by class manager, works as Data Matching
After rule success, major class commodity source code, then recursive call subclass commodity sound code file will be obtained first;
(6) by the setting to thread time-out, when can't detect whether thread runs succeeded in overtime interval time, then right
Thread process label is configured to fail, when in system idle state, the automatic trigger thread restarts;Thread pool
It will directly be handled based on identification, or increase the processing of worker's number, into queue to be processed, other thread pools can directly be put task
Enter queue to be processed, worker thread is waited to go to take out execution;
(7) according to the certain format data, the commodity data content of all extractions is saved into database;
HTTP parse the step of, (1) parse Html hypertext markup language source file, including in Html file js script file and
Css file, is therefrom handled;(2) it is handled by source file resolver, carries out dissection process in various formats;(3)
The hiding content that js script returns is obtained by http packet handler;(4) matching and processing of content are hidden;(5) final data
Processing integration;(6) the new task of thread manager concurrent processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610272886.4A CN105868412B (en) | 2016-04-28 | 2016-04-28 | A kind of multi-thread data grasping means based on B2B platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610272886.4A CN105868412B (en) | 2016-04-28 | 2016-04-28 | A kind of multi-thread data grasping means based on B2B platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105868412A CN105868412A (en) | 2016-08-17 |
CN105868412B true CN105868412B (en) | 2019-05-03 |
Family
ID=56629472
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610272886.4A Active CN105868412B (en) | 2016-04-28 | 2016-04-28 | A kind of multi-thread data grasping means based on B2B platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105868412B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107577525B (en) * | 2017-08-22 | 2020-11-17 | 努比亚技术有限公司 | Method and device for creating concurrent threads and computer-readable storage medium |
CN109101440B (en) * | 2018-08-01 | 2023-08-04 | 浪潮软件集团有限公司 | Method for processing trace data concurrent request based on JVM (Java virtual machine) cache |
CN109408695A (en) * | 2018-09-27 | 2019-03-01 | 苏州创旅天下信息技术有限公司 | Competing product data grab method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102968495A (en) * | 2012-11-29 | 2013-03-13 | 河海大学 | Vertical search engine and method for searching contrast association shopping information |
KR101374533B1 (en) * | 2013-04-17 | 2014-03-14 | 주식회사 엔써티 | High performance replication system and backup system for mass storage data, method of the same |
CN104050037A (en) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | Implementation method for directional crawler based on assigned e-commerce website |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
CN104583949A (en) * | 2012-08-16 | 2015-04-29 | 高通股份有限公司 | Pre-processing of scripts in web browsers |
-
2016
- 2016-04-28 CN CN201610272886.4A patent/CN105868412B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104583949A (en) * | 2012-08-16 | 2015-04-29 | 高通股份有限公司 | Pre-processing of scripts in web browsers |
CN102968495A (en) * | 2012-11-29 | 2013-03-13 | 河海大学 | Vertical search engine and method for searching contrast association shopping information |
KR101374533B1 (en) * | 2013-04-17 | 2014-03-14 | 주식회사 엔써티 | High performance replication system and backup system for mass storage data, method of the same |
CN104050037A (en) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | Implementation method for directional crawler based on assigned e-commerce website |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
Also Published As
Publication number | Publication date |
---|---|
CN105868412A (en) | 2016-08-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103645939B (en) | A kind of method and system of picture crawl | |
CN105868412B (en) | A kind of multi-thread data grasping means based on B2B platform | |
CN106095585B (en) | Task requests processing method, device and enterprise information system | |
US7318056B2 (en) | System and method for performing click stream analysis | |
CN107590188A (en) | A kind of reptile crawling method and its management system for automating vertical subdivision field | |
CN106844018A (en) | A kind of task processing method, apparatus and system | |
CN106371975B (en) | A kind of O&M automation method for early warning and system | |
CN107729214A (en) | A kind of visual distributed system monitors O&M method and device in real time | |
CN109325161A (en) | Public sentiment data grasping means, device, equipment and storage medium | |
CN105843893B (en) | Monitoring method and device based on the software update information that Web information extracts | |
CN105302815B (en) | The filter method and device of the uniform resource position mark URL of webpage | |
CN110457333B (en) | Data real-time updating method and device and computer readable storage medium | |
CN112365157A (en) | Intelligent dispatching method, device, equipment and storage medium | |
CN110417873A (en) | A kind of network information extraction system for realizing record webpage interactive operation | |
CN103559097B (en) | The method of interprocess communication, device and browser in a kind of browser | |
CN107341685A (en) | Data analysing method and device | |
CN108650546A (en) | Barrage processing method, computer readable storage medium and electronic equipment | |
CN111814192A (en) | Training sample generation method and device and sensitive information detection method and device | |
CN108519908A (en) | A kind of task dynamic management approach and device | |
CN108182595A (en) | A kind of formulation migration efficiency method and device | |
CN109783330A (en) | Log processing method, display methods and relevant apparatus, system | |
CN110716774A (en) | Data driving method, system and storage medium for brain of financial business data | |
CN102055620B (en) | Method and system for monitoring user experience | |
CN109408763A (en) | The method and system that the resume of a kind of pair of different templates is managed | |
CN109558887A (en) | A kind of method and apparatus of predictive behavior |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |