CN102937989A - Parallel distributed internet data capture method and system - Google Patents
Parallel distributed internet data capture method and system Download PDFInfo
- Publication number
- CN102937989A CN102937989A CN2012104225715A CN201210422571A CN102937989A CN 102937989 A CN102937989 A CN 102937989A CN 2012104225715 A CN2012104225715 A CN 2012104225715A CN 201210422571 A CN201210422571 A CN 201210422571A CN 102937989 A CN102937989 A CN 102937989A
- Authority
- CN
- China
- Prior art keywords
- crawl
- data
- text
- page
- comment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a parallel distributed internet data capture method and system. The method comprises the following steps of: setting capture configuration information of a destination site; capturing links of texts on a webpage index page one by one from the webpage index page of the destination site according to the capture configuration information, and deepening the links of the texts to crawl paging information and contents of the texts; and judging whether the texts contain review data, and if so, further deepening links of review pages to crawl review paging information and review contents. The parallel distributed internet data capture method and system have the advantages of high quality and high efficiency.
Description
Technical field
The present invention relates to Computer Applied Technology field and areas of information technology, be specifically related to a kind of parallelization distributed interconnection data grab method and system thereof.
Background technology
Now, the development of internet is maked rapid progress, and netizen's quantity of China also is being explosive growth.The internet progressively replaces traditional media (comprising newspaper, books, broadcasting, TV etc.), becomes the main source that people obtain and release news.Simultaneously, because the internet is free and open, it is simple to use, velocity of propagation is fast, the user is numerous, so that internet information can be propagated and impact rapidly.More and more important just because of the internet role, so various research for internet information is also flourish.In order to carry out the research of internet information, at first need the Internet web page information crawl that the form of magnanimity is different to process, and carry out unified format conversion, process to make things convenient for post analysis; Secondly, need to use high-quality and high-level efficiency crawl technology.Just be based on this active demand, we have developed parallelization distributed interconnection data grasping system.
Summary of the invention
The present invention one of is intended to solve the problems of the technologies described above at least to a certain extent or provides at least a kind of useful commerce to select.For this reason, one object of the present invention is to propose a kind of parallelization distributed interconnection data grab method and system thereof with high-quality and high-efficiency.
An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, comprising: the crawl configuration information of Offered target website; According to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; And judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of method of the present invention, also comprise: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.
In an embodiment of method of the present invention, described data crawl is to carry out with the distributed pattern of parallelization.
Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grasping system, comprising: configuration module, and described configuration module is used for the crawl configuration information of user's Offered target website; The text handling module according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter; Judge module, described judge module are used for judging whether described text comprises comment data; And the comment handling module, described comment handling module is used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of system of the present invention, also comprise: logger module, described logger module are used for log information occurring when unusual when the crawl process, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.
In an embodiment of system of the present invention, described data capture module is the parallelization distributed frame.
In sum, at first, the present invention proposes a parallelization distributed interconnection data grab method and system, it can expand the targeted sites that needs crawl freely by the mode of configuration, and adopted parallelization and Distributed Design, so that data crawl efficient and real-time are guaranteed.Secondly, the present invention has adopted and has declared flexibly heavy and increment grasping mechanism, under the prerequisite that need not the secondary Query Database, just monitor for the file of local page storing path, just can realize grasping declaring heavily and the increment crawl of webpage, guarantee the uniqueness of the data that grasp, saved a large amount of software and hardware resources.Moreover the present invention can also support for dynamically grasping with the unified of static two class webpages.Therefore, method and system of the present invention has high-quality and high efficiency advantage.
Additional aspect of the present invention and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment in conjunction with following accompanying drawing, wherein:
Fig. 1 is the process flow diagram of the parallelization distributed interconnection data grab method of the embodiment of the invention;
Fig. 2 is the structured flowchart of the parallelization distributed interconnection data grasping system of the embodiment of the invention;
Fig. 3 is the detail flowchart under normal circumstances of the parallelization distributed interconnection data grab method of the embodiment of the invention;
Fig. 4 is the detail flowchart under the abnormal conditions of parallelization distributed interconnection data grab method of the embodiment of the invention;
Fig. 5 is the layout structure schematic diagram of data capture module of the parallelization distributed nature of the embodiment of the invention; And
Fig. 6 is the unified crawl of the sound attitude webpage schematic diagram of the embodiment of the invention.
Embodiment
The below describes embodiments of the invention in detail, and the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that, term " " center "; " vertically "; " laterally "; " length "; " width "; " thickness ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward ", " clockwise ", orientation or the position relationship of indications such as " counterclockwise " are based on orientation shown in the drawings or position relationship, only be for convenience of description the present invention and simplified characterization, rather than device or the element of indication or hint indication must have specific orientation, with specific orientation structure and operation, therefore can not be interpreted as limitation of the present invention.
In addition, term " first ", " second " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.
In the present invention, unless clear and definite regulation and restriction are arranged in addition, broad understanding should be done in the terms such as term " installation ", " linking to each other ", " connection ", " fixing ", for example, can be to be fixedly connected with, and also can be to removably connect, or connect integratedly; Can be mechanical connection, also can be to be electrically connected; Can be directly to link to each other, also can indirectly link to each other by intermediary, can be the connection of two element internals.For the ordinary skill in the art, can understand as the case may be above-mentioned term concrete meaning in the present invention.
In the present invention, unless clear and definite regulation and restriction are arranged in addition, First Characteristic Second Characteristic it " on " or D score can comprise that the first and second features directly contact, can comprise that also the first and second features are not directly contacts but by the other feature contact between them.And, First Characteristic Second Characteristic " on ", " top " and " above " comprise First Characteristic directly over Second Characteristic and oblique upper, or only represent that the First Characteristic level height is higher than Second Characteristic.First Characteristic Second Characteristic " under ", " below " and " below " comprise First Characteristic under the Second Characteristic and tiltedly, or only represent that the First Characteristic level height is less than Second Characteristic.
The invention belongs to Computer Applied Technology field and areas of information technology, relate generally to the realization of the oriented network reptile that crawls based on depth-first.Web crawlers is based on basis and the prerequisite that internet information is analyzed, and all analysis operations all are to carry out on the basis of web crawlers crawl magnanimity internet data.
The fundamental purpose of patent of the present invention is for the efficiently and accurately that solves the magnanimity internet data grasps, and the crawl data are declared heavy and the increment crawl flexibly, the problem of the three aspects: such as unified crawl of dynamic and static webpage.Because being the ageing of data, an outstanding requirement of Internet-based data analysis wants high, and the internet data amount all is very surprising, so in order to guarantee the comprehensive of internet data, a kind of technology of crawl magnanimity internet data that can efficiently and accurately need to be arranged.We have developed parallelization distributed interconnection data grasping system for this reason.In order to realize the distributed characteristics of its parallelization, realized that a kind of crawl data of flexible and efficient rate are declared heavy and increment grasping mechanism, guarantee the uniqueness of crawl data and the high usage of software and hardware resources.Dynamic and the static two class webpages owing to the existence that is present in now the webpage on the internet.Therefore, our grasping system must support simultaneously that for dynamic and static Web page, what we taked is on all four grasping means and flow process, has reduced the complicacy of program and the difficulty of later maintenance to the crawl of dynamic and static Web page.
An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, as shown in Figure 1, comprising: the crawl configuration information of S1. Offered target website; S2. according to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; S3. and judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of method of the present invention, also comprise: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.
In an embodiment of method of the present invention, described data crawl is to carry out with the distributed pattern of parallelization.
Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grasping system, comprising: configuration module 100, and described configuration module is used for the crawl configuration information of user's Offered target website; Text handling module 200 according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter; Judge module 300, described judge module are used for judging whether described text comprises comment data; And comment handling module 400, described comment handling module is used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.
In an embodiment of system of the present invention, also comprise: logger module, described logger module are used for log information occurring when unusual when the crawl process, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.
In an embodiment of system of the present invention, described data capture module is the parallelization distributed frame.
For making those skilled in the art understand better technical scheme of the present invention, be further described below in conjunction with Fig. 3-Fig. 6.
The present invention relates generally to the efficiently and accurately crawl that how to solve the magnanimity internet data, how to declare flexibly heavy and the increment crawl for the crawl data, and the problem that how to realize the three aspects :s such as unified crawl of dynamic and static webpage.This part is with the overall realization situation of this system of paper; Secondly on overall basis of realizing, the concrete condition of introducing respectively various piece realizes with relevant program.
1. the overall realization situation of internet data grasping system
The overall design of parallelization distributed interconnection data crawls realizes, can be summarised as: take system in advance for the good crawl configuration information of targeted website space of a whole page human configuration as input.The log-on data grasping system is begun by website space of a whole page index page, grasps one by one all text links that occur on it, and gos deep into this text link crawl text and divide page information and body matter.Simultaneously, if certain piece of text comprises comment data, then further grasp the comment content on the review pages (comprising the paging comment).
The operational scheme of data grasping system as shown in Figure 3, the abnormality processing flow process is as shown in Figure 4.
2 internet data grasping system specific implementation situations
This part is according to the efficiently and accurately crawl that how to solve the magnanimity internet data, how to declare flexibly heavy and the increment crawl for the crawl data, and the problem that how to realize the three aspects :s such as unified crawl of dynamic and static webpage, the realization situation that the every aspect that makes introductions all round is concrete.
1) efficiently and accurately of magnanimity internet data crawl
In order to solve the efficiently and accurately crawl of magnanimity internet data, the present invention considers to solve from two aspects: the one, and parallelization namely starts a plurality of examples that crawl at same crawler server, walks abreast and carries out the operation that crawls of network data; The 2nd, distributed, namely simultaneously multiple servers deploy reptile.Reptile on every station server can both work alone simultaneously.The schematic diagram of parallelization distributed interconnection data grasping system as shown in Figure 5, whole data grasping system launches round a central database.Around this central database, dispose many crawler servers, and moving simultaneously a plurality of threads that crawl on each crawler server.A kind of like this program structure and realization have guaranteed the efficient in real time problem of crawl of magnanimity internet data.But, also introduced simultaneously crawl Data duplication and problem that how increment grasps, and this next part problem that need to solve just.
2) the crawl data are declared heavy and the increment crawl flexibly
The success or not of internet data grasping system design, important index be exactly its whether supported data declare heavy and increment grasps.On the one hand, because reptile all is circular flow, can grab unavoidably the info web of repetition.On the other hand, because internet information is all in real-time change, for twice different crawl of the same website space of a whole page, because the crawl asynchronism(-nization), may after once in the crawl, the information under the space of a whole page has changed (particularly review information can increase gradually) to some extent.And because a large amount of data are preserved in crawl before, crawl next time just should not go to grasp the data of having preserved again.If repeat crawl, data redundancy can cause analysis result afterwards inaccurate on the one hand; On the other hand, repeat the resources such as crawl meeting serious waste system, the network bandwidth.In order to realize grasping declaring of data function heavy and that increment grasps, native system has adopted a kind of succinct implementation.Namely by checking under the catalogue of preserving the crawl info web whether had the related web page file, judge that whether the webpage of this time crawl is for repeating crawl.Concrete design is when preserving the crawl data, and its bibliographic structure has comprised the webpage relevant information, every grade of catalogue represented respectively webpage from the information such as website, the space of a whole page, the sub-space of a whole page, title.Such bibliographic structure has guaranteed that identical webpage will inevitably point to identical saving contents.Can not preserve file of the same name in the same directory, realize declaring heavily with this.This design advantage is as follows: at first, do not need to carry out frequently alternately with the data access source of outside, only need to obtain information needed by the probe result file, efficient is higher; Secondly, assurance reptile module only has an entrance and exit, the coupling between reduction and other modules.
3) crawl of sound attitude webpage is unified realizes
Page crawl can be divided into static page and crawl with dynamic page and crawl.The crawl of static Web page is very simple, only need to ask static network address then the response contents that obtains to be preserved by http, all texts that just can obtain showing on the webpage and comment content.But, dynamic page just can't simple like this processing, because dynamic page just dynamically generates, do not comprise text shown on the page or comment content in its page source code, just some the JavaScripts orders that have after the page request of sending to server.The various http responses that occur when utilizing the Fiddler tool analysis to open webpage and movable can analyze, and dynamic web page also can access by certain form.These dynamic contents just convert other form to and are kept under the hiding network address, just can obtain dynamic content if can access hiding network address.Therefore, the processing of obtaining the Static and dynamic webpage just is summed up as finds a certain network address that really comprises required content, then asks this network address by http, and the response contents that obtains is preserved, and just can obtain the Static and dynamic content.The schematic diagram that the Static and dynamic webpage is unified the crawl process as shown in Figure 6.The key that the unified crawl of sound attitude webpage realizes is to find target URI.Treatment scheme after finding target URI is just unified, only needs to send the http request by the DownloadPage function to target URI, then just can access corresponding http response, has comprised the content information of webpage in this response.This processing procedure is equally applicable to the crawl for Web page text and review pages.For dynamically why different with the crawl of static Web page, we can find out by Fig. 6, the target URI of static Web page easily finds (easy URI mode combinations by program and configuration forms in other words conj.or perhaps), and the target URI of dynamic web page normally stashes, need to could obtain by the external tool analysis.Utilizing after the external tool analysis draws the pattern and rule of the target URI of dynamic web page or target URI, whole dynamically and the crawl process of static Web page just be unified into a flow process.Usually distinguishing part dynamically and in the static Web page crawl process among the upper figure all is in the phase process that grasps configuration, so in fact the realization of whole capture program is exactly known target URI or target URI rule processing procedure afterwards.
In sum, at first, the present invention proposes a parallelization distributed interconnection data grab method and system, it can expand the targeted sites that needs crawl freely by the mode of configuration, and adopted parallelization and Distributed Design, so that data crawl efficient and real-time are guaranteed.Secondly, the present invention has adopted and has declared flexibly heavy and increment grasping mechanism, under the prerequisite that need not the secondary Query Database, just monitor for the file of local page storing path, just can realize grasping declaring heavily and the increment crawl of webpage, guarantee the uniqueness of the data that grasp, saved a large amount of software and hardware resources.Moreover the present invention can also support for dynamically grasping with the unified of static two class webpages.Therefore, method and system of the present invention has high-quality and high efficiency advantage.
Need to prove, describe and to be understood in the process flow diagram or in this any process of otherwise describing or method, expression comprises the module of code of the executable instruction of the step that one or more is used to realize specific logical function or process, fragment or part, and the scope of preferred implementation of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should be understood by the embodiments of the invention person of ordinary skill in the field.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or the example in conjunction with specific features, structure, material or the characteristics of this embodiment or example description.In this manual, the schematic statement of above-mentioned term not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or characteristics can be with suitable mode combinations in any one or more embodiment or example.
Although the above has illustrated and has described embodiments of the invention, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art is not in the situation that break away from principle of the present invention and aim can change above-described embodiment within the scope of the invention, modification, replacement and modification.
Claims (6)
1. a parallelization distributed interconnection data grab method is characterized in that, comprises step:
The crawl configuration information of Offered target website;
According to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; And
Judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.
2. parallelization distributed interconnection data grab method as claimed in claim 1 is characterized in that, also comprises: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.
3. such as claim 1 and 2 described parallelization distributed interconnection data grab methods, it is characterized in that, described data crawl is to carry out with the distributed pattern of parallelization.
4. a parallelization distributed interconnection data grasping system is characterized in that, comprising:
Configuration module, described configuration module is used for the crawl configuration information of user's Offered target website;
The text handling module according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter;
Judge module, described judge module are used for judging whether described text comprises comment data; And
Comment handling module, described comment handling module are used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.
5. parallelization distributed interconnection data grasping system as claimed in claim 4, it is characterized in that, also comprise: logger module, described logger module is used for occurring when unusual when the crawl process, log information, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.
6. such as claim 4 and 5 described parallelization distributed interconnection data grasping systems, it is characterized in that, described data capture module is the parallelization distributed frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210422571.5A CN102937989B (en) | 2012-10-29 | 2012-10-29 | Parallelization distributed interconnection data grab method and system thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210422571.5A CN102937989B (en) | 2012-10-29 | 2012-10-29 | Parallelization distributed interconnection data grab method and system thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102937989A true CN102937989A (en) | 2013-02-20 |
CN102937989B CN102937989B (en) | 2016-06-22 |
Family
ID=47696886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210422571.5A Expired - Fee Related CN102937989B (en) | 2012-10-29 | 2012-10-29 | Parallelization distributed interconnection data grab method and system thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102937989B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103258017A (en) * | 2013-04-24 | 2013-08-21 | 中国科学院计算技术研究所 | Method and system for parallel square crossing network data collection |
CN104391917A (en) * | 2014-11-19 | 2015-03-04 | 四川长虹电器股份有限公司 | Method for incrementally capturing webpage contents |
CN104965888A (en) * | 2015-06-16 | 2015-10-07 | 武汉华工赛百数据系统有限公司 | Data acquiring method and system |
CN105447184A (en) * | 2015-12-15 | 2016-03-30 | 北京百分点信息科技有限公司 | Information capturing method and device |
CN106126747A (en) * | 2016-07-14 | 2016-11-16 | 北京邮电大学 | Data capture method based on reptile and device |
CN107506425A (en) * | 2017-08-18 | 2017-12-22 | 广东电网有限责任公司信息中心 | A kind of web page files gather archiving method |
CN107590236A (en) * | 2017-09-09 | 2018-01-16 | 杭州数立方征信有限公司 | A kind of big data acquisition method and system towards enterprise in charge of construction |
CN108121751A (en) * | 2016-11-30 | 2018-06-05 | 北京国双科技有限公司 | The method and apparatus of web page crawl |
CN108932299A (en) * | 2018-06-07 | 2018-12-04 | 北京迈格威科技有限公司 | The method and device being updated for the model to inline system |
CN109213824A (en) * | 2017-06-29 | 2019-01-15 | 北京京东尚科信息技术有限公司 | Data grabber system, method and apparatus |
CN111651656A (en) * | 2020-06-02 | 2020-09-11 | 重庆邮电大学 | Method and system for dynamic webpage crawler based on agent mode |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070130125A1 (en) * | 2005-12-05 | 2007-06-07 | Bmenu As | System, process and software arrangement for assisting in navigating the internet |
CN101291304A (en) * | 2008-06-13 | 2008-10-22 | 清华大学 | Transplantable network information sharing method |
CN101404666A (en) * | 2008-10-06 | 2009-04-08 | 赵洪宇 | Infinite layer collection method based on Web page |
CN102609412A (en) * | 2011-01-07 | 2012-07-25 | 华东师范大学 | RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system |
-
2012
- 2012-10-29 CN CN201210422571.5A patent/CN102937989B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070130125A1 (en) * | 2005-12-05 | 2007-06-07 | Bmenu As | System, process and software arrangement for assisting in navigating the internet |
CN101291304A (en) * | 2008-06-13 | 2008-10-22 | 清华大学 | Transplantable network information sharing method |
CN101404666A (en) * | 2008-10-06 | 2009-04-08 | 赵洪宇 | Infinite layer collection method based on Web page |
CN102609412A (en) * | 2011-01-07 | 2012-07-25 | 华东师范大学 | RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103258017B (en) * | 2013-04-24 | 2016-04-13 | 中国科学院计算技术研究所 | A kind of parallel square crossing network data acquisition method and system |
CN103258017A (en) * | 2013-04-24 | 2013-08-21 | 中国科学院计算技术研究所 | Method and system for parallel square crossing network data collection |
CN104391917A (en) * | 2014-11-19 | 2015-03-04 | 四川长虹电器股份有限公司 | Method for incrementally capturing webpage contents |
CN104965888A (en) * | 2015-06-16 | 2015-10-07 | 武汉华工赛百数据系统有限公司 | Data acquiring method and system |
CN105447184A (en) * | 2015-12-15 | 2016-03-30 | 北京百分点信息科技有限公司 | Information capturing method and device |
CN105447184B (en) * | 2015-12-15 | 2019-06-11 | 北京百分点信息科技有限公司 | Information extraction method and device |
CN106126747A (en) * | 2016-07-14 | 2016-11-16 | 北京邮电大学 | Data capture method based on reptile and device |
CN108121751A (en) * | 2016-11-30 | 2018-06-05 | 北京国双科技有限公司 | The method and apparatus of web page crawl |
CN109213824A (en) * | 2017-06-29 | 2019-01-15 | 北京京东尚科信息技术有限公司 | Data grabber system, method and apparatus |
CN107506425A (en) * | 2017-08-18 | 2017-12-22 | 广东电网有限责任公司信息中心 | A kind of web page files gather archiving method |
CN107590236A (en) * | 2017-09-09 | 2018-01-16 | 杭州数立方征信有限公司 | A kind of big data acquisition method and system towards enterprise in charge of construction |
CN107590236B (en) * | 2017-09-09 | 2020-08-28 | 数立方(杭州)信息科技有限公司 | Big data acquisition method and system for building construction enterprises |
CN108932299A (en) * | 2018-06-07 | 2018-12-04 | 北京迈格威科技有限公司 | The method and device being updated for the model to inline system |
CN111651656A (en) * | 2020-06-02 | 2020-09-11 | 重庆邮电大学 | Method and system for dynamic webpage crawler based on agent mode |
Also Published As
Publication number | Publication date |
---|---|
CN102937989B (en) | 2016-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102937989A (en) | Parallel distributed internet data capture method and system | |
US9767082B2 (en) | Method and system of retrieving ajax web page content | |
CN109299073B (en) | Data blood margin generation method and system, electronic equipment and storage medium | |
Do et al. | Three-dimensional numerical simulation of mechanized twin stacked tunnels in soft ground | |
CN106250454A (en) | The loading method of a kind of page script and device | |
CN103778236A (en) | Webpage data distribution processing method and device as well as webpage generation processing method and device | |
CN102262635A (en) | Page crawler system and page crawler method | |
CN109063144A (en) | Visual network crawler method and device | |
CN104750866B (en) | Method and system based on flush type WEB server dynamic generation menu bar | |
CN103701906A (en) | Distributed real-time calculation system and data processing method thereof | |
CN111708962A (en) | Rendering method, device and equipment of skeleton screen and storage medium | |
CN102004805B (en) | Webpage denoising system and method based on maximum similarity matching | |
CN105574027A (en) | On-line transaction processing/on-line analytical processing (OLTP/OLAP) hybrid application based multi-dimensional performance data storage method, device and system | |
CN112860844A (en) | Case clue processing system, method and device and computer equipment | |
CN201594261U (en) | Site information resolution system | |
Xu et al. | Simultaneous shape and topology optimization of truss under local and global stability constraints | |
US9021349B1 (en) | System, method, and computer program product for identifying differences in a EDA design | |
CN105739997A (en) | Website construction assistance design system | |
Yuk et al. | Comparison of extraction methods for bug tracking system analysis | |
Regenauer-Lieb et al. | Understanding earthquake precursors: from subcritical instabilities to catastrophic events | |
Siddiqui | An investigation into data sharing between building information modelling and fire safety engineering, with potential applications to smart buildings | |
CN106528711B (en) | Intersection solving method and system for data of out-of-table files | |
Slone et al. | A finite volume unstructured mesh approach to dynamic fluid-structure interaction: an assessment of the challenge of flutter analysis | |
Schoenstein | Computer Science on Campus: Technology,(Inter) disciplinarity, and the Transformation of the American University | |
Khorasani et al. | Comparison between the behavior of RCS and Concrete Moment Frames under Near Fault Earthquakes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160622 Termination date: 20171029 |
|
CF01 | Termination of patent right due to non-payment of annual fee |