CN102937989A - Parallel distributed internet data capture method and system - Google Patents

Parallel distributed internet data capture method and system Download PDF

Info

Publication number
CN102937989A
CN102937989A CN2012104225715A CN201210422571A CN102937989A CN 102937989 A CN102937989 A CN 102937989A CN 2012104225715 A CN2012104225715 A CN 2012104225715A CN 201210422571 A CN201210422571 A CN 201210422571A CN 102937989 A CN102937989 A CN 102937989A
Authority
CN
China
Prior art keywords
crawl
data
text
page
comment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012104225715A
Other languages
Chinese (zh)
Other versions
CN102937989B (en
Inventor
杨睿尘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tengyi Science & Technology Development Co Ltd
Original Assignee
Beijing Tengyi Science & Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tengyi Science & Technology Development Co Ltd filed Critical Beijing Tengyi Science & Technology Development Co Ltd
Priority to CN201210422571.5A priority Critical patent/CN102937989B/en
Publication of CN102937989A publication Critical patent/CN102937989A/en
Application granted granted Critical
Publication of CN102937989B publication Critical patent/CN102937989B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a parallel distributed internet data capture method and system. The method comprises the following steps of: setting capture configuration information of a destination site; capturing links of texts on a webpage index page one by one from the webpage index page of the destination site according to the capture configuration information, and deepening the links of the texts to crawl paging information and contents of the texts; and judging whether the texts contain review data, and if so, further deepening links of review pages to crawl review paging information and review contents. The parallel distributed internet data capture method and system have the advantages of high quality and high efficiency.

Description

Parallelization distributed interconnection data grab method and system thereof
Technical field
The present invention relates to Computer Applied Technology field and areas of information technology, be specifically related to a kind of parallelization distributed interconnection data grab method and system thereof.
Background technology
Now, the development of internet is maked rapid progress, and netizen's quantity of China also is being explosive growth.The internet progressively replaces traditional media (comprising newspaper, books, broadcasting, TV etc.), becomes the main source that people obtain and release news.Simultaneously, because the internet is free and open, it is simple to use, velocity of propagation is fast, the user is numerous, so that internet information can be propagated and impact rapidly.More and more important just because of the internet role, so various research for internet information is also flourish.In order to carry out the research of internet information, at first need the Internet web page information crawl that the form of magnanimity is different to process, and carry out unified format conversion, process to make things convenient for post analysis; Secondly, need to use high-quality and high-level efficiency crawl technology.Just be based on this active demand, we have developed parallelization distributed interconnection data grasping system.
Summary of the invention
The present invention one of is intended to solve the problems of the technologies described above at least to a certain extent or provides at least a kind of useful commerce to select.For this reason, one object of the present invention is to propose a kind of parallelization distributed interconnection data grab method and system thereof with high-quality and high-efficiency.
An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, comprising: the crawl configuration information of Offered target website; According to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; And judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of method of the present invention, also comprise: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.
In an embodiment of method of the present invention, described data crawl is to carry out with the distributed pattern of parallelization.
Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grasping system, comprising: configuration module, and described configuration module is used for the crawl configuration information of user's Offered target website; The text handling module according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter; Judge module, described judge module are used for judging whether described text comprises comment data; And the comment handling module, described comment handling module is used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of system of the present invention, also comprise: logger module, described logger module are used for log information occurring when unusual when the crawl process, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.
In an embodiment of system of the present invention, described data capture module is the parallelization distributed frame.
In sum, at first, the present invention proposes a parallelization distributed interconnection data grab method and system, it can expand the targeted sites that needs crawl freely by the mode of configuration, and adopted parallelization and Distributed Design, so that data crawl efficient and real-time are guaranteed.Secondly, the present invention has adopted and has declared flexibly heavy and increment grasping mechanism, under the prerequisite that need not the secondary Query Database, just monitor for the file of local page storing path, just can realize grasping declaring heavily and the increment crawl of webpage, guarantee the uniqueness of the data that grasp, saved a large amount of software and hardware resources.Moreover the present invention can also support for dynamically grasping with the unified of static two class webpages.Therefore, method and system of the present invention has high-quality and high efficiency advantage.
Additional aspect of the present invention and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment in conjunction with following accompanying drawing, wherein:
Fig. 1 is the process flow diagram of the parallelization distributed interconnection data grab method of the embodiment of the invention;
Fig. 2 is the structured flowchart of the parallelization distributed interconnection data grasping system of the embodiment of the invention;
Fig. 3 is the detail flowchart under normal circumstances of the parallelization distributed interconnection data grab method of the embodiment of the invention;
Fig. 4 is the detail flowchart under the abnormal conditions of parallelization distributed interconnection data grab method of the embodiment of the invention;
Fig. 5 is the layout structure schematic diagram of data capture module of the parallelization distributed nature of the embodiment of the invention; And
Fig. 6 is the unified crawl of the sound attitude webpage schematic diagram of the embodiment of the invention.
Embodiment
The below describes embodiments of the invention in detail, and the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that, term " " center "; " vertically "; " laterally "; " length "; " width "; " thickness ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward ", " clockwise ", orientation or the position relationship of indications such as " counterclockwise " are based on orientation shown in the drawings or position relationship, only be for convenience of description the present invention and simplified characterization, rather than device or the element of indication or hint indication must have specific orientation, with specific orientation structure and operation, therefore can not be interpreted as limitation of the present invention.
In addition, term " first ", " second " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.
In the present invention, unless clear and definite regulation and restriction are arranged in addition, broad understanding should be done in the terms such as term " installation ", " linking to each other ", " connection ", " fixing ", for example, can be to be fixedly connected with, and also can be to removably connect, or connect integratedly; Can be mechanical connection, also can be to be electrically connected; Can be directly to link to each other, also can indirectly link to each other by intermediary, can be the connection of two element internals.For the ordinary skill in the art, can understand as the case may be above-mentioned term concrete meaning in the present invention.
In the present invention, unless clear and definite regulation and restriction are arranged in addition, First Characteristic Second Characteristic it " on " or D score can comprise that the first and second features directly contact, can comprise that also the first and second features are not directly contacts but by the other feature contact between them.And, First Characteristic Second Characteristic " on ", " top " and " above " comprise First Characteristic directly over Second Characteristic and oblique upper, or only represent that the First Characteristic level height is higher than Second Characteristic.First Characteristic Second Characteristic " under ", " below " and " below " comprise First Characteristic under the Second Characteristic and tiltedly, or only represent that the First Characteristic level height is less than Second Characteristic.
The invention belongs to Computer Applied Technology field and areas of information technology, relate generally to the realization of the oriented network reptile that crawls based on depth-first.Web crawlers is based on basis and the prerequisite that internet information is analyzed, and all analysis operations all are to carry out on the basis of web crawlers crawl magnanimity internet data.
The fundamental purpose of patent of the present invention is for the efficiently and accurately that solves the magnanimity internet data grasps, and the crawl data are declared heavy and the increment crawl flexibly, the problem of the three aspects: such as unified crawl of dynamic and static webpage.Because being the ageing of data, an outstanding requirement of Internet-based data analysis wants high, and the internet data amount all is very surprising, so in order to guarantee the comprehensive of internet data, a kind of technology of crawl magnanimity internet data that can efficiently and accurately need to be arranged.We have developed parallelization distributed interconnection data grasping system for this reason.In order to realize the distributed characteristics of its parallelization, realized that a kind of crawl data of flexible and efficient rate are declared heavy and increment grasping mechanism, guarantee the uniqueness of crawl data and the high usage of software and hardware resources.Dynamic and the static two class webpages owing to the existence that is present in now the webpage on the internet.Therefore, our grasping system must support simultaneously that for dynamic and static Web page, what we taked is on all four grasping means and flow process, has reduced the complicacy of program and the difficulty of later maintenance to the crawl of dynamic and static Web page.
An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, as shown in Figure 1, comprising: the crawl configuration information of S1. Offered target website; S2. according to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; S3. and judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of method of the present invention, also comprise: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.
In an embodiment of method of the present invention, described data crawl is to carry out with the distributed pattern of parallelization.
Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grasping system, comprising: configuration module 100, and described configuration module is used for the crawl configuration information of user's Offered target website; Text handling module 200 according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter; Judge module 300, described judge module are used for judging whether described text comprises comment data; And comment handling module 400, described comment handling module is used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of system of the present invention, also comprise: logger module, described logger module are used for log information occurring when unusual when the crawl process, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.
In an embodiment of system of the present invention, described data capture module is the parallelization distributed frame.
For making those skilled in the art understand better technical scheme of the present invention, be further described below in conjunction with Fig. 3-Fig. 6.
The present invention relates generally to the efficiently and accurately crawl that how to solve the magnanimity internet data, how to declare flexibly heavy and the increment crawl for the crawl data, and the problem that how to realize the three aspects :s such as unified crawl of dynamic and static webpage.This part is with the overall realization situation of this system of paper; Secondly on overall basis of realizing, the concrete condition of introducing respectively various piece realizes with relevant program.
1. the overall realization situation of internet data grasping system
The overall design of parallelization distributed interconnection data crawls realizes, can be summarised as: take system in advance for the good crawl configuration information of targeted website space of a whole page human configuration as input.The log-on data grasping system is begun by website space of a whole page index page, grasps one by one all text links that occur on it, and gos deep into this text link crawl text and divide page information and body matter.Simultaneously, if certain piece of text comprises comment data, then further grasp the comment content on the review pages (comprising the paging comment).
The operational scheme of data grasping system as shown in Figure 3, the abnormality processing flow process is as shown in Figure 4.
2 internet data grasping system specific implementation situations
This part is according to the efficiently and accurately crawl that how to solve the magnanimity internet data, how to declare flexibly heavy and the increment crawl for the crawl data, and the problem that how to realize the three aspects :s such as unified crawl of dynamic and static webpage, the realization situation that the every aspect that makes introductions all round is concrete.
1) efficiently and accurately of magnanimity internet data crawl
In order to solve the efficiently and accurately crawl of magnanimity internet data, the present invention considers to solve from two aspects: the one, and parallelization namely starts a plurality of examples that crawl at same crawler server, walks abreast and carries out the operation that crawls of network data; The 2nd, distributed, namely simultaneously multiple servers deploy reptile.Reptile on every station server can both work alone simultaneously.The schematic diagram of parallelization distributed interconnection data grasping system as shown in Figure 5, whole data grasping system launches round a central database.Around this central database, dispose many crawler servers, and moving simultaneously a plurality of threads that crawl on each crawler server.A kind of like this program structure and realization have guaranteed the efficient in real time problem of crawl of magnanimity internet data.But, also introduced simultaneously crawl Data duplication and problem that how increment grasps, and this next part problem that need to solve just.
2) the crawl data are declared heavy and the increment crawl flexibly
The success or not of internet data grasping system design, important index be exactly its whether supported data declare heavy and increment grasps.On the one hand, because reptile all is circular flow, can grab unavoidably the info web of repetition.On the other hand, because internet information is all in real-time change, for twice different crawl of the same website space of a whole page, because the crawl asynchronism(-nization), may after once in the crawl, the information under the space of a whole page has changed (particularly review information can increase gradually) to some extent.And because a large amount of data are preserved in crawl before, crawl next time just should not go to grasp the data of having preserved again.If repeat crawl, data redundancy can cause analysis result afterwards inaccurate on the one hand; On the other hand, repeat the resources such as crawl meeting serious waste system, the network bandwidth.In order to realize grasping declaring of data function heavy and that increment grasps, native system has adopted a kind of succinct implementation.Namely by checking under the catalogue of preserving the crawl info web whether had the related web page file, judge that whether the webpage of this time crawl is for repeating crawl.Concrete design is when preserving the crawl data, and its bibliographic structure has comprised the webpage relevant information, every grade of catalogue represented respectively webpage from the information such as website, the space of a whole page, the sub-space of a whole page, title.Such bibliographic structure has guaranteed that identical webpage will inevitably point to identical saving contents.Can not preserve file of the same name in the same directory, realize declaring heavily with this.This design advantage is as follows: at first, do not need to carry out frequently alternately with the data access source of outside, only need to obtain information needed by the probe result file, efficient is higher; Secondly, assurance reptile module only has an entrance and exit, the coupling between reduction and other modules.
3) crawl of sound attitude webpage is unified realizes
Page crawl can be divided into static page and crawl with dynamic page and crawl.The crawl of static Web page is very simple, only need to ask static network address then the response contents that obtains to be preserved by http, all texts that just can obtain showing on the webpage and comment content.But, dynamic page just can't simple like this processing, because dynamic page just dynamically generates, do not comprise text shown on the page or comment content in its page source code, just some the JavaScripts orders that have after the page request of sending to server.The various http responses that occur when utilizing the Fiddler tool analysis to open webpage and movable can analyze, and dynamic web page also can access by certain form.These dynamic contents just convert other form to and are kept under the hiding network address, just can obtain dynamic content if can access hiding network address.Therefore, the processing of obtaining the Static and dynamic webpage just is summed up as finds a certain network address that really comprises required content, then asks this network address by http, and the response contents that obtains is preserved, and just can obtain the Static and dynamic content.The schematic diagram that the Static and dynamic webpage is unified the crawl process as shown in Figure 6.The key that the unified crawl of sound attitude webpage realizes is to find target URI.Treatment scheme after finding target URI is just unified, only needs to send the http request by the DownloadPage function to target URI, then just can access corresponding http response, has comprised the content information of webpage in this response.This processing procedure is equally applicable to the crawl for Web page text and review pages.For dynamically why different with the crawl of static Web page, we can find out by Fig. 6, the target URI of static Web page easily finds (easy URI mode combinations by program and configuration forms in other words conj.or perhaps), and the target URI of dynamic web page normally stashes, need to could obtain by the external tool analysis.Utilizing after the external tool analysis draws the pattern and rule of the target URI of dynamic web page or target URI, whole dynamically and the crawl process of static Web page just be unified into a flow process.Usually distinguishing part dynamically and in the static Web page crawl process among the upper figure all is in the phase process that grasps configuration, so in fact the realization of whole capture program is exactly known target URI or target URI rule processing procedure afterwards.
In sum, at first, the present invention proposes a parallelization distributed interconnection data grab method and system, it can expand the targeted sites that needs crawl freely by the mode of configuration, and adopted parallelization and Distributed Design, so that data crawl efficient and real-time are guaranteed.Secondly, the present invention has adopted and has declared flexibly heavy and increment grasping mechanism, under the prerequisite that need not the secondary Query Database, just monitor for the file of local page storing path, just can realize grasping declaring heavily and the increment crawl of webpage, guarantee the uniqueness of the data that grasp, saved a large amount of software and hardware resources.Moreover the present invention can also support for dynamically grasping with the unified of static two class webpages.Therefore, method and system of the present invention has high-quality and high efficiency advantage.
Need to prove, describe and to be understood in the process flow diagram or in this any process of otherwise describing or method, expression comprises the module of code of the executable instruction of the step that one or more is used to realize specific logical function or process, fragment or part, and the scope of preferred implementation of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should be understood by the embodiments of the invention person of ordinary skill in the field.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or the example in conjunction with specific features, structure, material or the characteristics of this embodiment or example description.In this manual, the schematic statement of above-mentioned term not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or characteristics can be with suitable mode combinations in any one or more embodiment or example.
Although the above has illustrated and has described embodiments of the invention, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art is not in the situation that break away from principle of the present invention and aim can change above-described embodiment within the scope of the invention, modification, replacement and modification.

Claims (6)

1. a parallelization distributed interconnection data grab method is characterized in that, comprises step:
The crawl configuration information of Offered target website;
According to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; And
Judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.
2. parallelization distributed interconnection data grab method as claimed in claim 1 is characterized in that, also comprises: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.
3. such as claim 1 and 2 described parallelization distributed interconnection data grab methods, it is characterized in that, described data crawl is to carry out with the distributed pattern of parallelization.
4. a parallelization distributed interconnection data grasping system is characterized in that, comprising:
Configuration module, described configuration module is used for the crawl configuration information of user's Offered target website;
The text handling module according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter;
Judge module, described judge module are used for judging whether described text comprises comment data; And
Comment handling module, described comment handling module are used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.
5. parallelization distributed interconnection data grasping system as claimed in claim 4, it is characterized in that, also comprise: logger module, described logger module is used for occurring when unusual when the crawl process, log information, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.
6. such as claim 4 and 5 described parallelization distributed interconnection data grasping systems, it is characterized in that, described data capture module is the parallelization distributed frame.
CN201210422571.5A 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof Expired - Fee Related CN102937989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210422571.5A CN102937989B (en) 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210422571.5A CN102937989B (en) 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof

Publications (2)

Publication Number Publication Date
CN102937989A true CN102937989A (en) 2013-02-20
CN102937989B CN102937989B (en) 2016-06-22

Family

ID=47696886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210422571.5A Expired - Fee Related CN102937989B (en) 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof

Country Status (1)

Country Link
CN (1) CN102937989B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258017A (en) * 2013-04-24 2013-08-21 中国科学院计算技术研究所 Method and system for parallel square crossing network data collection
CN104391917A (en) * 2014-11-19 2015-03-04 四川长虹电器股份有限公司 Method for incrementally capturing webpage contents
CN104965888A (en) * 2015-06-16 2015-10-07 武汉华工赛百数据系统有限公司 Data acquiring method and system
CN105447184A (en) * 2015-12-15 2016-03-30 北京百分点信息科技有限公司 Information capturing method and device
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN107506425A (en) * 2017-08-18 2017-12-22 广东电网有限责任公司信息中心 A kind of web page files gather archiving method
CN107590236A (en) * 2017-09-09 2018-01-16 杭州数立方征信有限公司 A kind of big data acquisition method and system towards enterprise in charge of construction
CN108121751A (en) * 2016-11-30 2018-06-05 北京国双科技有限公司 The method and apparatus of web page crawl
CN108932299A (en) * 2018-06-07 2018-12-04 北京迈格威科技有限公司 The method and device being updated for the model to inline system
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN111651656A (en) * 2020-06-02 2020-09-11 重庆邮电大学 Method and system for dynamic webpage crawler based on agent mode

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070130125A1 (en) * 2005-12-05 2007-06-07 Bmenu As System, process and software arrangement for assisting in navigating the internet
CN101291304A (en) * 2008-06-13 2008-10-22 清华大学 Transplantable network information sharing method
CN101404666A (en) * 2008-10-06 2009-04-08 赵洪宇 Infinite layer collection method based on Web page
CN102609412A (en) * 2011-01-07 2012-07-25 华东师范大学 RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070130125A1 (en) * 2005-12-05 2007-06-07 Bmenu As System, process and software arrangement for assisting in navigating the internet
CN101291304A (en) * 2008-06-13 2008-10-22 清华大学 Transplantable network information sharing method
CN101404666A (en) * 2008-10-06 2009-04-08 赵洪宇 Infinite layer collection method based on Web page
CN102609412A (en) * 2011-01-07 2012-07-25 华东师范大学 RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258017B (en) * 2013-04-24 2016-04-13 中国科学院计算技术研究所 A kind of parallel square crossing network data acquisition method and system
CN103258017A (en) * 2013-04-24 2013-08-21 中国科学院计算技术研究所 Method and system for parallel square crossing network data collection
CN104391917A (en) * 2014-11-19 2015-03-04 四川长虹电器股份有限公司 Method for incrementally capturing webpage contents
CN104965888A (en) * 2015-06-16 2015-10-07 武汉华工赛百数据系统有限公司 Data acquiring method and system
CN105447184A (en) * 2015-12-15 2016-03-30 北京百分点信息科技有限公司 Information capturing method and device
CN105447184B (en) * 2015-12-15 2019-06-11 北京百分点信息科技有限公司 Information extraction method and device
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN108121751A (en) * 2016-11-30 2018-06-05 北京国双科技有限公司 The method and apparatus of web page crawl
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN107506425A (en) * 2017-08-18 2017-12-22 广东电网有限责任公司信息中心 A kind of web page files gather archiving method
CN107590236A (en) * 2017-09-09 2018-01-16 杭州数立方征信有限公司 A kind of big data acquisition method and system towards enterprise in charge of construction
CN107590236B (en) * 2017-09-09 2020-08-28 数立方(杭州)信息科技有限公司 Big data acquisition method and system for building construction enterprises
CN108932299A (en) * 2018-06-07 2018-12-04 北京迈格威科技有限公司 The method and device being updated for the model to inline system
CN111651656A (en) * 2020-06-02 2020-09-11 重庆邮电大学 Method and system for dynamic webpage crawler based on agent mode

Also Published As

Publication number Publication date
CN102937989B (en) 2016-06-22

Similar Documents

Publication Publication Date Title
CN102937989A (en) Parallel distributed internet data capture method and system
US9767082B2 (en) Method and system of retrieving ajax web page content
CN109299073B (en) Data blood margin generation method and system, electronic equipment and storage medium
Do et al. Three-dimensional numerical simulation of mechanized twin stacked tunnels in soft ground
CN106250454A (en) The loading method of a kind of page script and device
CN103778236A (en) Webpage data distribution processing method and device as well as webpage generation processing method and device
CN102262635A (en) Page crawler system and page crawler method
CN109063144A (en) Visual network crawler method and device
CN104750866B (en) Method and system based on flush type WEB server dynamic generation menu bar
CN103701906A (en) Distributed real-time calculation system and data processing method thereof
CN111708962A (en) Rendering method, device and equipment of skeleton screen and storage medium
CN102004805B (en) Webpage denoising system and method based on maximum similarity matching
CN105574027A (en) On-line transaction processing/on-line analytical processing (OLTP/OLAP) hybrid application based multi-dimensional performance data storage method, device and system
CN112860844A (en) Case clue processing system, method and device and computer equipment
CN201594261U (en) Site information resolution system
Xu et al. Simultaneous shape and topology optimization of truss under local and global stability constraints
US9021349B1 (en) System, method, and computer program product for identifying differences in a EDA design
CN105739997A (en) Website construction assistance design system
Yuk et al. Comparison of extraction methods for bug tracking system analysis
Regenauer-Lieb et al. Understanding earthquake precursors: from subcritical instabilities to catastrophic events
Siddiqui An investigation into data sharing between building information modelling and fire safety engineering, with potential applications to smart buildings
CN106528711B (en) Intersection solving method and system for data of out-of-table files
Slone et al. A finite volume unstructured mesh approach to dynamic fluid-structure interaction: an assessment of the challenge of flutter analysis
Schoenstein Computer Science on Campus: Technology,(Inter) disciplinarity, and the Transformation of the American University
Khorasani et al. Comparison between the behavior of RCS and Concrete Moment Frames under Near Fault Earthquakes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160622

Termination date: 20171029

CF01 Termination of patent right due to non-payment of annual fee