CN102937989B - Parallelization distributed interconnection data grab method and system thereof - Google Patents

Parallelization distributed interconnection data grab method and system thereof Download PDF

Info

Publication number
CN102937989B
CN102937989B CN201210422571.5A CN201210422571A CN102937989B CN 102937989 B CN102937989 B CN 102937989B CN 201210422571 A CN201210422571 A CN 201210422571A CN 102937989 B CN102937989 B CN 102937989B
Authority
CN
China
Prior art keywords
crawl
page
configuration information
text
comment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210422571.5A
Other languages
Chinese (zh)
Other versions
CN102937989A (en
Inventor
杨睿尘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tengyi Science & Technology Development Co Ltd
Original Assignee
Beijing Tengyi Science & Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tengyi Science & Technology Development Co Ltd filed Critical Beijing Tengyi Science & Technology Development Co Ltd
Priority to CN201210422571.5A priority Critical patent/CN102937989B/en
Publication of CN102937989A publication Critical patent/CN102937989A/en
Application granted granted Critical
Publication of CN102937989B publication Critical patent/CN102937989B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The proposition one parallelization distributed interconnection data grab method of the present invention and system, wherein the method comprising the steps of: arranges the crawl configuration information of targeted website;According to described crawl configuration information, from the space of a whole page index page of targeted website, capture the link of the text occurred on described space of a whole page index page one by one, and the link going deep into described text crawls text paging information and body matter;And judge whether described text comprises comment data, if comprised, then the link going deep into described review pages further crawls comment paging information and comment content。The advantage that the parallelization distributed interconnection data grab method of the present invention and system have high quality and high efficiency。

Description

Parallelization distributed interconnection data grab method and system thereof
Technical field
The present invention relates to Computer Applied Technology field and areas of information technology, be specifically related to a kind of parallelization distributed interconnection data grab method and system thereof。
Background technology
Now, the development of the Internet is maked rapid progress, and netizen's quantity of China is also in explosive growth。The Internet progressively replaces traditional media (including newspaper, books, broadcast, TV etc.), becomes the main source that people obtain and release news。Simultaneously as the Internet is free and open, use the feature simple, spread speed is fast, user is numerous so that internet information can be propagated rapidly and impact。More and more important just because of the Internet role, so the various research for internet information is also flourish。In order to carry out the research of internet information, it is necessary first to the Internet web page information crawl of the different formats of magnanimity processed, and carry out unified form conversion, to facilitate post analysis to process;Secondly, it is necessary to application high-quality and high efficiency capture technology。It is based on this urgent needs, we have developed parallelization distributed interconnection data grabber system。
Summary of the invention
It is contemplated that solve one of above-mentioned technical problem at least to a certain extent or provide at a kind of useful business selection。For this, it is an object of the present invention to propose a kind of parallelization distributed interconnection data grab method with high-quality and high-efficiency and system thereof。
An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, including: the crawl configuration information of targeted website is set;According to described crawl configuration information, from the space of a whole page index page of targeted website, capture the link of the text occurred on described space of a whole page index page one by one, and the link going deep into described text crawls text paging information and body matter;And judge whether described text comprises comment data, if comprised, then the link going deep into described review pages further crawls comment paging information and comment content。
In an embodiment of the method for the present invention, also including: when the process of crawl occurs abnormal, log information, carrying out retrying crawl, until capturing successfully。
In an embodiment of the method for the present invention, described data grabber is to carry out with the distributed pattern of parallelization。
Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grabber system, including: configuration module, described configuration module is for the crawl configuration information of user setup targeted website;Text handling module, according to described crawl configuration information, from the space of a whole page index page of targeted website, captures the link of the text occurred on described space of a whole page index page one by one, and the link going deep into described text crawls text paging information and body matter;Judge module, described judge module is used for judging whether described text comprises comment data;And comment handling module, described comment handling module is for when described text comprises comment data, and the link going deep into described review pages further crawls comment paging information and comment content。
In an embodiment of the system of the present invention, also including: logger module, described logger module is used for when the process of crawl occurs abnormal, log information, now described parallelization distributed interconnection data grabber system carries out retrying crawl, until capturing successfully。
In an embodiment of the system of the present invention, described data capture module is parallelization distributed frame。
In sum, first, the present invention proposes a parallelization distributed interconnection data grab method and system, the mode that it can pass through to configure extends the targeted sites needing to capture freely, and have employed parallelization and Distributed Design so that data grabber efficiency and real-time are guaranteed。Secondly, present invention employs and sentence weight and increment grasping mechanism flexibly, under the premise inquiring about data base without secondary, file just for local page storing path is monitored, the weight of sentencing that just can realize capturing webpage captures with increment, ensure that the uniqueness of captured data, save substantial amounts of software and hardware resources。Furthermore, the present invention can also support the unified crawl for dynamic and static two class webpages。Therefore, the method and system of the present invention has high-quality and high efficiency advantage。
The additional aspect of the present invention and advantage will part provide in the following description, and part will become apparent from the description below, or is recognized by the practice of the present invention。
Accompanying drawing explanation
Above-mentioned and/or the additional aspect of the present invention and advantage are from conjunction with will be apparent from easy to understand the accompanying drawings below description to embodiment, wherein:
Fig. 1 is the flow chart of the parallelization distributed interconnection data grab method of the embodiment of the present invention;
Fig. 2 is the structured flowchart of the parallelization distributed interconnection data grabber system of the embodiment of the present invention;
Fig. 3 is the detail flowchart under normal circumstances of the parallelization distributed interconnection data grab method of the embodiment of the present invention;
Fig. 4 is the detail flowchart under the abnormal conditions of the parallelization distributed interconnection data grab method of the embodiment of the present invention;
Fig. 5 is the layout structure schematic diagram of the data capture module of the parallelization distributed nature of the embodiment of the present invention;And
Fig. 6 is the dynamic static Web page unified crawl schematic diagram of the embodiment of the present invention。
Detailed description of the invention
Being described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of same or like function from start to finish。The embodiment described below with reference to accompanying drawing is illustrative of, it is intended to is used for explaining the present invention, and is not considered as limiting the invention。
In describing the invention, it will be appreciated that, term " " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward ", " clockwise ", orientation or the position relationship of the instruction such as " counterclockwise " are based on orientation shown in the drawings or position relationship, it is for only for ease of the description present invention and simplifies description, rather than the device of instruction or hint indication or element must have specific orientation, with specific azimuth configuration and operation, therefore it is not considered as limiting the invention。
Additionally, term " first ", " second " are only for descriptive purposes, and it is not intended that indicate or imply relative importance or the implicit quantity indicating indicated technical characteristic。Thus, define " first ", the feature of " second " can express or implicitly include one or more these features。In describing the invention, " multiple " are meant that two or more, unless otherwise expressly limited specifically。
In the present invention, unless otherwise clearly defined and limited, the term such as term " installation ", " being connected ", " connection ", " fixing " should be interpreted broadly, for instance, it is possible to it is fixing connection, it is also possible to be removably connect, or connect integratedly;Can be mechanically connected, it is also possible to be electrical connection;Can be joined directly together, it is also possible to be indirectly connected to by intermediary, it is possible to be the connection of two element internals。For the ordinary skill in the art, it is possible to understand above-mentioned term concrete meaning in the present invention as the case may be。
In the present invention, unless otherwise clearly defined and limited, fisrt feature second feature it " on " or D score can include the first and second features and directly contact, it is also possible to include the first and second features and be not directly contact but by the other characterisation contact between them。And, fisrt feature second feature " on ", " top " and " above " include fisrt feature directly over second feature and oblique upper, or be merely representative of fisrt feature level height higher than second feature。Fisrt feature second feature " under ", " lower section " and " below " include fisrt feature immediately below second feature and obliquely downward, or be merely representative of fisrt feature level height less than second feature。
The invention belongs to Computer Applied Technology field and areas of information technology, relate generally to the realization of the oriented network reptile crawled based on depth-first。Web crawlers is based on basis and the premise that internet information is analyzed, and all of analysis operation is all capture at web crawlers to carry out on the basis of magnanimity internet data。
The main purpose of patent of the present invention is to solve the crawl of the efficiently and accurately of magnanimity internet data, captures data and sentences weight and increment crawl flexibly, the problem that the unification crawl of dynamic and static webpage waits three aspects。Owing to being that the ageing of data wants height based on internet data analysis prominent requirement, and internet data amount is all very surprising, so in order to ensure the comprehensive of internet data, it is desirable to have a kind of can the technology of crawl magnanimity internet data of efficiently and accurately。Parallelization distributed interconnection data grabber system is we have developed for this。In order to realize the distributed feature of its parallelization, it is achieved that the crawl data of a kind of flexible and efficient rate sentence weight and increment grasping mechanism, it is ensured that capture the uniqueness of data and the high usage of software and hardware resources。Owing to now existing in the existence of the webpage on the Internet dynamically and static two class webpages。Therefore, our grasping system allows for the crawl simultaneously supported dynamic and static Web page, and for dynamic and static Web page, what we taked is on all four grasping means and flow process, reduces the complexity of program and the difficulty of later maintenance。
An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, as it is shown in figure 1, include: S1. arranges the crawl configuration information of targeted website;S2. according to described crawl configuration information, from the space of a whole page index page of targeted website, capture the link of the text occurred on described space of a whole page index page one by one, and the link going deep into described text crawls text paging information and body matter;And judge whether described text comprises comment data S3., if comprised, then the link going deep into described review pages further crawls comment paging information and comment content。
In an embodiment of the method for the present invention, also including: when the process of crawl occurs abnormal, log information, carrying out retrying crawl, until capturing successfully。
In an embodiment of the method for the present invention, described data grabber is to carry out with the distributed pattern of parallelization。
Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grabber system, including: configuration module 100, described configuration module is for the crawl configuration information of user setup targeted website;Text handling module 200, according to described crawl configuration information, from the space of a whole page index page of targeted website, captures the link of the text occurred on described space of a whole page index page one by one, and the link going deep into described text crawls text paging information and body matter;Judge module 300, described judge module is used for judging whether described text comprises comment data;And comment handling module 400, described comment handling module is for when described text comprises comment data, and the link going deep into described review pages further crawls comment paging information and comment content。
In an embodiment of the system of the present invention, also including: logger module, described logger module is used for when the process of crawl occurs abnormal, log information, now described parallelization distributed interconnection data grabber system carries out retrying crawl, until capturing successfully。
In an embodiment of the system of the present invention, described data capture module is parallelization distributed frame。
For making those skilled in the art be more fully understood that technical scheme, it is further described below in conjunction with Fig. 3-Fig. 6。
The efficiently and accurately that the invention mainly relates to how to solve magnanimity internet data captures, and how to sentence weight flexibly and increment captures for capturing data, and the problem how realizing three aspects such as unified crawl of dynamic and static webpage。The situation that totally realizes of this system will be introduced in this part first;Secondly, on the overall basis realized, concrete condition and the relevant program of introducing various piece respectively realize。
1. the situation that totally realizes of internet data grasping system
The master-plan of parallelization distributed interconnection data grabber realizes, it is possible to overview is: with system in advance for the good crawl configuration information of targeted website space of a whole page human configuration for input。Start data grabber system, by the space of a whole page index page of website, capture all text links occurred on it one by one, and go deep into this text link crawl text paging information and body matter。Meanwhile, if certain section of text comprises comment data, then further capture the comment content in review pages (including paging comment)。
The operational process of data grabber system as it is shown on figure 3, abnormality processing flow process as shown in Figure 4。
2 internet data grasping systems implement situation
This part captures according to the efficiently and accurately how solving magnanimity internet data, how to sentence weight flexibly for crawl data to capture with increment, and how to realize the problem of three aspects such as unified crawl of dynamic and static webpage, the every aspect that makes introductions all round concrete implementation situation。
1) efficiently and accurately of magnanimity internet data captures
Efficiently and accurately in order to solve magnanimity internet data captures, and the present invention considers to solve from two aspects: one is parallelization, namely starts on same crawler server and multiple crawls example, and what carry out network data parallel crawls operation;Two is distributed, namely disposes reptile on multiple servers simultaneously。Reptile on every station server can work alone simultaneously。The schematic diagram of parallelization distributed interconnection data grabber system is as it is shown in figure 5, whole data grabber system is curved about what a central database launched。Around this central database, dispose multiple stage crawler server, and each crawler server runs simultaneously and multiple crawls thread。Such a program structure and realization, it is ensured that the problem captured during the efficient real of magnanimity internet data。But, have also been introduced Data duplication and the problem that how increment captures of capturing simultaneously, and this problem that next part needs solution just。
2) capture data and sentence weight and increment crawl flexibly
The success or not of internet data grasping system design, an important index is exactly whether it supports that data sentence weight and increment crawl。On the one hand, because reptile is all circular flow, the info web of repetition can be grabbed unavoidably。On the other hand, because internet information is all in real-time change, for twice different crawl of the same website space of a whole page, different owing to capturing the time, be likely to after once capture in, the information under the space of a whole page is varied from (particularly review information can increase gradually)。And owing to crawl before preserves substantial amounts of data, crawl next time just should not go to capture saved data again。If repeating to capture, on the one hand data redundancy, the analysis result after causing is inaccurate;On the other hand, the resources such as crawl meeting serious waste system, the network bandwidth are repeated。In order to realize capturing the function sentencing weight and increment crawl of data, native system have employed a kind of succinct implementation。Namely by checking to preserve whether there is related web page file under the catalogue capturing info web, judge that whether the webpage this time captured is for repeating to capture。Concrete design be preserve capture data time, its bibliographic structure contains webpage relevant information, every grade of catalogue represent respectively webpage from website, the space of a whole page, the sub-space of a whole page, the information such as title。Such bibliographic structure ensure that identical webpage will necessarily point to identical saving contents。Same directory can not preserve file of the same name, realize sentencing weight with this。This design advantage is as follows: first, it is not necessary to carry out frequently mutual with outside data access source, it is only necessary to obtaining information needed by probe result file, efficiency is higher;Secondly, it is ensured that reptile module only one of which entrance and exit, reduction and the coupling between other modules。
3) crawl of dynamic static Web page is unified realizes
The page captures and static page can be divided into crawl crawl with dynamic page。The crawl of static Web page is very simple, it is only necessary to then preserved by the response contents obtained by http request static state network address, it is possible to obtain all texts and the comment content of display on webpage。But, dynamic page just cannot so simply process, because dynamic page just dynamically generates after the page sends request to server, in its page source code, do not comprise text shown on the page or comment content, simply some the JavaScripts orders having。Utilizing the various http that Fiddler tool analysis occurs when opening webpage to respond and movable, it is possible to analyze, dynamic web page can also pass through what certain form can obtain。These dynamic contents simply convert other form to and are saved under hiding network address, if it is possible to the network address obtaining hiding can be obtained by dynamic content。Therefore, static and dynamic web page the process of acquisition is just attributed to finds a certain network address really comprising required content, then passes through this network address of http request, and the response contents that will obtain preserves, it is possible to obtain static state and dynamic content。Static and dynamic web page unifies the schematic diagram of crawl process as shown in Figure 6。Dynamic static Web page unified capture realize it is crucial that find target URI。Finding the handling process after target URI just unified, it is only necessary to send http request by DownloadPage function to target URI, be then able to obtain corresponding http response, this response contains the content information of webpage。This processing procedure, is equally applicable to the crawl for Web page text and review pages。Why different for dynamic and static Web page crawl, be we can see that by Fig. 6, the target URI of static Web page is easily found (or perhaps easily by URI mode combinations of program and configuration), and what the target URI of dynamic web page usually stashed, it is necessary to could be obtained by external tool analysis。After utilizing external tool analysis to draw the pattern of target URI or target URI and the rule of dynamic web page, whole dynamically and the crawl process of static Web page be just unified into a flow process。Generally distinguishing part dynamically and in static Web page crawl process in upper figure, is all carrying out the phase process that crawl configures, so the realization of whole capture program is actually the processing procedure after known target URI or target URI rule。
In sum, first, the present invention proposes a parallelization distributed interconnection data grab method and system, the mode that it can pass through to configure extends the targeted sites needing to capture freely, and have employed parallelization and Distributed Design so that data grabber efficiency and real-time are guaranteed。Secondly, present invention employs and sentence weight and increment grasping mechanism flexibly, under the premise inquiring about data base without secondary, file just for local page storing path is monitored, the weight of sentencing that just can realize capturing webpage captures with increment, ensure that the uniqueness of captured data, save substantial amounts of software and hardware resources。Furthermore, the present invention can also support the unified crawl for dynamic and static two class webpages。Therefore, the method and system of the present invention has high-quality and high efficiency advantage。
It should be noted that, describe in flow chart or in this any process described otherwise above or method and be construed as, represent and include one or more module for the code of the executable instruction of the step that realizes specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press order that is shown or that discuss, including according to involved function by basic mode simultaneously or in the opposite order, perform function, this should be understood by embodiments of the invention person of ordinary skill in the field。
In the description of this specification, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means in conjunction with this embodiment or example describe are contained at least one embodiment or the example of the present invention。In this manual, the schematic representation of above-mentioned term is not necessarily referring to identical embodiment or example。And, the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiments or example。
Although above it has been shown and described that embodiments of the invention, it is understandable that, above-described embodiment is illustrative of, being not considered as limiting the invention, above-described embodiment can be changed when without departing from principles of the invention and objective, revises, replace and modification by those of ordinary skill in the art within the scope of the invention。

Claims (2)

1. a parallelization distributed interconnection data grab method, it is characterised in that include step:
The crawl configuration information of targeted website is set, wherein, described crawl configuration information includes the crawl configuration information capturing configuration information and dynamic page of static page, wherein, by the http of dynamic web page response and activity are analyzed, obtain the network address hidden in described dynamic web page the network address by hiding described in http request, and the corresponding response contents obtained is preserved, to realize the crawl configuration information of dynamic page;
According to described crawl configuration information, from the space of a whole page index page of targeted website, capture the link of the text occurred on described space of a whole page index page one by one, and the link going deep into described text crawls text paging information and body matter;And
Judging whether described text comprises comment data, if comprised, then the link going deep into review pages further crawls comment paging information and comment content;
When the process of crawl occurs abnormal, log information, carry out retrying crawl, until capturing successfully, wherein, described data grabber carries out with the distributed pattern of parallelization。
2. a parallelization distributed interconnection data grabber system, it is characterised in that including:
Configuration module, described configuration module is for the crawl configuration information of user setup targeted website, wherein, described crawl configuration information includes the crawl configuration information capturing configuration information and dynamic page of static page, wherein, by the http of dynamic web page response and activity are analyzed, obtain the network address hidden in described dynamic web page, and the network address by hiding described in http request, and the corresponding response contents obtained is preserved, to realize the crawl configuration information of dynamic page;
Text handling module, according to described crawl configuration information, from the space of a whole page index page of targeted website, captures the link of the text occurred on described space of a whole page index page one by one, and the link going deep into described text crawls text paging information and body matter;
Judge module, described judge module is used for judging whether described text comprises comment data;And
Comment handling module, described comment handling module is for when described text comprises comment data, and the link going deep into review pages further crawls comment paging information and comment content;
Logger module, described logger module is for when the process of crawl occurs abnormal, and log information, now described parallelization distributed interconnection data grabber system carries out retrying crawl, until capturing successfully。
CN201210422571.5A 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof Expired - Fee Related CN102937989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210422571.5A CN102937989B (en) 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210422571.5A CN102937989B (en) 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof

Publications (2)

Publication Number Publication Date
CN102937989A CN102937989A (en) 2013-02-20
CN102937989B true CN102937989B (en) 2016-06-22

Family

ID=47696886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210422571.5A Expired - Fee Related CN102937989B (en) 2012-10-29 2012-10-29 Parallelization distributed interconnection data grab method and system thereof

Country Status (1)

Country Link
CN (1) CN102937989B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258017B (en) * 2013-04-24 2016-04-13 中国科学院计算技术研究所 A kind of parallel square crossing network data acquisition method and system
CN104391917A (en) * 2014-11-19 2015-03-04 四川长虹电器股份有限公司 Method for incrementally capturing webpage contents
CN104965888A (en) * 2015-06-16 2015-10-07 武汉华工赛百数据系统有限公司 Data acquiring method and system
CN105447184B (en) * 2015-12-15 2019-06-11 北京百分点信息科技有限公司 Information extraction method and device
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN108121751B (en) * 2016-11-30 2021-01-22 北京国双科技有限公司 Webpage crawling method and device
CN109213824B (en) * 2017-06-29 2022-03-04 北京京东尚科信息技术有限公司 Data capture system, method and device
CN107506425A (en) * 2017-08-18 2017-12-22 广东电网有限责任公司信息中心 A kind of web page files gather archiving method
CN107590236B (en) * 2017-09-09 2020-08-28 数立方(杭州)信息科技有限公司 Big data acquisition method and system for building construction enterprises
CN108932299A (en) * 2018-06-07 2018-12-04 北京迈格威科技有限公司 The method and device being updated for the model to inline system
CN111651656B (en) * 2020-06-02 2023-02-24 重庆邮电大学 Method and system for dynamic webpage crawler based on agent mode

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291304A (en) * 2008-06-13 2008-10-22 清华大学 Transplantable network information sharing method
CN101404666A (en) * 2008-10-06 2009-04-08 赵洪宇 Infinite layer collection method based on Web page
CN102609412A (en) * 2011-01-07 2012-07-25 华东师范大学 RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NO325961B1 (en) * 2005-12-05 2008-08-25 Holte Bjoern System, process and software arrangement to assist in navigation on the Internet

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291304A (en) * 2008-06-13 2008-10-22 清华大学 Transplantable network information sharing method
CN101404666A (en) * 2008-10-06 2009-04-08 赵洪宇 Infinite layer collection method based on Web page
CN102609412A (en) * 2011-01-07 2012-07-25 华东师范大学 RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system

Also Published As

Publication number Publication date
CN102937989A (en) 2013-02-20

Similar Documents

Publication Publication Date Title
CN102937989B (en) Parallelization distributed interconnection data grab method and system thereof
EP2414929A1 (en) Method and system of retrieving ajax web page content
EP2605155A1 (en) Processing method and device for world wide web page
CN104933168B (en) A kind of web page contents automatic acquiring method
CN103838796A (en) Webpage structured information extraction method
CN101201823A (en) System and method for detecting website variation
Haw et al. A comparative study and benchmarking on xml parsers
Kabadjov et al. NewsGist: a multilingual statistical news summarizer
Moghadam et al. Explaining and designing an entrepreneurial human resource management model: Grounded theory approach (A study power Industry, Iran Transfo Corporation)
CN103246680A (en) Method and device for aggregating and displaying webpage contents in browser
CN101140578A (en) Method and system for multithread analyzing web page data
US9021349B1 (en) System, method, and computer program product for identifying differences in a EDA design
Yuk et al. Comparison of extraction methods for bug tracking system analysis
Najafi et al. Performance Management System Pathology in Small and Medium Sized Enterprises of Iran Capital Market (A Case Study in Amin Investment Bank)(The Case: Amin Investment Bank)
CN107506478A (en) A kind of method and apparatus for distinguishing Website page
Aghaee et al. Investigation of relationship between job fatigue, organizational citizenship behavior and organizational atmosphere
Khaleghkhah et al. Investigating the Role of Moral Leadership in Predicting of Job Behavior and Breaking the Organizational Silence
Slone et al. A finite volume unstructured mesh approach to dynamic fluid-structure interaction: an assessment of the challenge of flutter analysis
Mardani et al. The representation of research experience model and its relationship with researcher spirituality and research self-efficacy of MA students of Islamic Azad University of Sari
Lu et al. Designed a web crawler which oriented network public opinion data acquisition
Zhi-Juan et al. A Method for Collecting Tibetan-Websites
Dong From semantic web to expressive software specifications: a modeling languages spectrum
CN114398535A (en) User role-oriented intelligent network specific information acquisition system and interaction method
Rozinajová et al. One approach to HTML wrappers creation: using document object model tree
Nainys et al. Supression of mechanical oscillations in a nonlinear system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160622

Termination date: 20171029

CF01 Termination of patent right due to non-payment of annual fee