CN106484828B - Distributed internet data rapid acquisition system and acquisition method - Google Patents

Distributed internet data rapid acquisition system and acquisition method Download PDF

Info

Publication number
CN106484828B
CN106484828B CN201610864062.6A CN201610864062A CN106484828B CN 106484828 B CN106484828 B CN 106484828B CN 201610864062 A CN201610864062 A CN 201610864062A CN 106484828 B CN106484828 B CN 106484828B
Authority
CN
China
Prior art keywords
data
hyperlink
webpage
extraction
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610864062.6A
Other languages
Chinese (zh)
Other versions
CN106484828A (en
Inventor
张晖
杨春明
李晓伟
李波
赵旭剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Zhongke Rongchuang Technology Co ltd
Original Assignee
Southwest University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University of Science and Technology filed Critical Southwest University of Science and Technology
Priority to CN201610864062.6A priority Critical patent/CN106484828B/en
Publication of CN106484828A publication Critical patent/CN106484828A/en
Application granted granted Critical
Publication of CN106484828B publication Critical patent/CN106484828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed internet data rapid acquisition system which comprises five layers, namely a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer; the seed website setting node is used for setting various parameters and extraction rules of the storage data source; the hyperlink acquisition layer is used for requesting a hyperlink list webpage of the data source and extracting a hyperlink of a target webpage; the real-time queue is used for accessing the URL hyperlinks extracted by the hyperlink acquisition layer and corresponding extraction rules and the accessed URL hyperlinks; the webpage downloading and analyzing layer is used for requesting and analyzing the URL hyperlink which is not accessed in the real-time queue and formatting and extracting specific data; the webpage data storage layer is used for storing the target data which are extracted by the webpage downloading and analyzing layer in a formatting mode. The invention adopts a distributed hierarchical cooperation mode to acquire data and can meet the application requirements of a system with large data acquisition amount, multiple data sources and high real-time requirement.

Description

Distributed internet data rapid acquisition system and acquisition method
Technical Field
The invention belongs to the technical field of internet big data acquisition, and particularly relates to a distributed internet data rapid acquisition system and an acquisition method.
Background
The rapid development of the internet brings society into an information age with highly developed and open data, and a big data age has come. The data has an extremely important role in enterprise operation, government decision, social dynamic analysis and the like, and how to acquire data rapidly in a large scale becomes a technical focus, but from the prior technical scheme, a data acquisition method needs to be improved. The traditional internet data acquisition mainly uses a web crawler as a main tool and uses structured or semi-structured text data as an object to collect data. The web crawler is a program or script which automatically walks to crawl internet text web pages according to certain rules. Text data is mostly nested in the web page program code. The real-time property of data acquisition directly determines the effectiveness and timeliness of data, and the rapid acquisition of the data becomes the most important factor.
Disclosure of Invention
In view of this, the invention provides a distributed internet data rapid acquisition system and an acquisition method, aiming at the problems of large data acquisition amount, multiple data sources and low real-time performance.
In order to solve the technical problem, the invention discloses a distributed internet data rapid acquisition system which comprises five layers, namely a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer;
the seed website setting node is used for setting various parameters, extraction rules and the like of a storage data source and is a single node;
the hyperlink acquisition layer is used for requesting a hyperlink list webpage of a data source and extracting a hyperlink of a target webpage;
the real-time queue is used for accessing the URL hyperlink extracted by the hyperlink acquisition layer, the extraction rule corresponding to the hyperlink and the accessed URL hyperlink;
the webpage downloading and analyzing layer is used for requesting and analyzing the URL hyperlink which is not accessed in the real-time queue and formatting and extracting specific data;
the webpage data storage layer is used for storing the target data extracted by the webpage downloading and analyzing layer in a formatting mode.
Compared with the prior art, the invention can obtain the following technical effects:
1) the acquisition system of the invention runs in a waterfall mode, has high real-time performance and strong expandability, and has stronger response capability to the system requirements of more data sources and large data acquisition amount.
2) The invention adopts a distributed layered cooperation acquisition mode to acquire data, can meet the system requirements of more data sources, large data acquisition quantity and high real-time performance, and has the characteristics of higher expandability and customizability. The data extraction comprises two extraction schemes of a structured accurate extraction scheme and a general text extraction scheme (only aiming at text parts), and the extracted data has higher integrity.
Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a block diagram of a distributed Internet data rapid acquisition system of the present invention;
fig. 2 is a flow chart of the distributed internet data rapid acquisition method of the present invention.
Detailed Description
The following embodiments are described in detail with reference to the accompanying drawings, so that how to implement the technical features of the present invention to solve the technical problems and achieve the technical effects can be fully understood and implemented.
The invention discloses a distributed internet data rapid acquisition system, which comprises a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer as shown in figure 1,
the seed website setting node is used for setting various parameters, extraction rules and the like of a storage data source and is a single node; the seed website setting node uses a relational database.
The hyperlink acquisition layer is used for requesting a hyperlink list webpage of a data source and extracting a hyperlink of a target webpage; the hyperlink acquisition layer is composed of a single or a plurality of web crawler nodes, the web crawlers are physically isolated from each other and logically cooperate to complete the hyperlink extraction work of the target webpage, and the hyperlink acquisition layer can be transversely expanded according to the scale of a data source. The hyperlink acquisition layer is distributed and multi-node and can be transversely expanded, the operation of a single node is timed, and the node integrates the acquired hyperlinks into a corresponding extraction rule and stores the extracted hyperlink into a real-time queue.
The real-time queue is used for accessing the URL hyperlink extracted by the hyperlink acquisition layer, the extraction rule corresponding to the hyperlink and the accessed URL hyperlink; the real-time queue is a core component of cooperative crawling operation nodes. The layer has high real-time performance, can realize real-time storage and extraction of data, has the capacity of persistent storage, and simultaneously plays a role in filtering the collected URL hyperlink. The real-time queue is deployed independently, the unaccessed hyperlinks are accessed in real-time, and the accessed hyperlinks are stored persistently.
The webpage downloading and analyzing layer is used for requesting and analyzing the URL hyperlink which is not accessed in the real-time queue and formatting and extracting specific data. The webpage downloading and analyzing layer similar to the hyperlink acquisition layer is composed of a plurality of web crawler nodes, the nodes work independently, the nodes are mainly responsible for the structured information extraction work of target data, and the hyperlink scale extracted by the hyperlink acquisition layer can be transversely expanded. Each node comprises a set of information structuring extraction method based on HTML document tree and a set of general text information extraction method, and the extraction of the text part of the webpage can be switched and used. The webpage downloading and analyzing layer is distributed and multi-node and can be transversely expanded, the operation of a single node is real-time, and the nodes read hyperlinks in the real-time queue, then filter, analyze and format target data and store the target data.
The webpage data storage layer is used for storing the target data extracted by the webpage downloading and analyzing layer in a formatting mode. The webpage data storage layer is realized by adopting an open-source big data storage database, is multi-node storage, has strong storage capacity for storing webpage document data, has large storage capacity and excellent reading performance, and can dynamically expand the number of nodes.
A method for rapidly acquiring data based on a distributed network is based on the system for rapidly acquiring data based on the distributed network, as shown in FIG. 2, and comprises the following specific steps:
step 1, setting all seed URL, extraction rules, website codes and other information by a seed website setting node;
the user adds all target websites needing to be collected through a web system, the target websites comprise a plurality of target blocks which are interesting to the user, such as education, livelihood and the like, and then the following settings are carried out:
firstly, setting website information including a domain name, a type, a page code, a universal target URL filtering regular expression and a universal information extraction rule (including an author, release time, a text and the like) of a current website, wherein the information currently set by the website is suitable for all collected sections under the website; as shown in table 1:
table 1 website setting table
And secondly, setting the layout information of the website, including the name and the seed URL of the layout, and setting the information at a corresponding position in a personalized manner if the domain name, the target URL filtering regular expression and the information extraction rule of the layout are different from the general setting of the website. If the website sets the domain name to http:// news sc. org/, and the version block has the domain name to http:// edu. news sc. org/. If some information of the current layout block is the same as the general settings of the website, the layout block does not need to be set repeatedly. Its subordinate section is set for "website 1" in table 1 as table 2:
table 2 table 1 'website 1' section setting table
Figure BDA0001123759680000042
Step 2, the nodes in the hyperlink acquisition layer read the data source information at regular time and acquire the URL of the specific list page of the data source, and the data source information and the URL are formatted and stored in a real-time queue together with the corresponding structured extraction rule;
the nodes in the hyperlink acquisition layer read the website and the layout information set in the step 1 at regular time, format processing is carried out by taking the layouts as basic units, and the process is described in detail as follows for each acquisition layout:
step one, inheriting all information of a website to which a current edition block belongs;
secondly, if the domain name, URL filtering regular expression, and page information extraction rule (including author, release time, text, etc.) are set, the block will use its personalized setting to replace the corresponding setting information inherited from the website setting, as shown in table 3:
table 3 block formatting
Thirdly, requesting a URL page of the section seed, and extracting a URL set of a target webpage by using a URL filtering regular expression of a current section (each section is collected to obtain a URL set);
and fourthly, with a single element in each current URL set as a target URL, splicing the complete target URL by using the domain name set by the current block (because the target URL may exist in a mode of '/abc.html' or '/abc.html' in a page, the target URL needs to be spliced into a complete 'http:// cu.china.com.cn/abc.html') form a new multi-element group by combining with the corresponding block information set in the second step, such as:
the URL set is obtained by collecting the seed URL of the section 1 (i.e., S1):
SET_1:(link_1,link_2,link_3………..link_m)
the tuples stored in the queue are:
< link _1, section 1, Website 1, GBK,1, rule 2, rule 3, (general) rule 4>
< link _2, section 1, Website 1, GBK,1, rule 2, rule 3, (general) rule 4>
……………………………
< link _ m, section 1, website 1, GBK,1, rule 2, rule 3, (general) rule 4 >;
set of URLs collected by seed URL of section 2 (S2):
SET2:(link_1,link_2,link_3………..link_n)
then the tuple stored in the queue:
< link _1, section 1, website 1, GBK,1, (general) rule 2, (general) rule 3, rule 4>
……………………………
< link _ n, section 1, website 1, GBK,1, (general) rule 2, (general) rule 3, rule 4 >.
The information in the tuple includes the page code, website name, layout name, page code, title extraction rule, author extraction rule, text extraction rule, release time extraction rule, website type, etc. of the target URL. And finally, pressing the set multi-element group into a real-time queue, and binding the target URL with the extraction rule in the process to realize one-to-one correspondence.
And 3, reading the webpage hyperlink request download in the real-time queue in real time by the nodes in the webpage downloading and analyzing layer, and extracting data in a structured manner by combining with the extraction rule accompanying the transmission, if the text data is successfully extracted, directly storing the data into a webpage data storage layer, otherwise, extracting the text by adopting a general text extraction method, and then storing the text into the webpage data storage layer.
A node in the webpage downloading and analyzing layer reads the multi-element group set in the step 2 in real time, requests a target page, analyzes the page by adopting a corresponding extraction rule in the multi-element group for the target page which is successfully requested, extracts non-text part information, and reports out that the rule is empty abnormal or extraction failure abnormal and records a log if the rule is empty or the extraction is invalid by adopting the current rule; for text extraction, if the set text extraction rule is empty or the extraction fails, a general text extraction method (serving as a standby extraction means and not setting a rule) is adopted, and if the extraction fails, a general extraction failure exception is reported and a log is recorded; and if the extraction is successful, storing the corresponding data into the data storage layer in a single record mode.
Note that: general text extraction methods (density-based text extraction methods) used in the present invention have been published in papers, see journal: [1] zhu Jed, Li \28156, Zhang Jian, Chen Lei, Zengxinhua, extracting Web text based on a text density model [ J ] pattern recognition and artificial intelligence, 2013,07:667 plus 672.
While the foregoing description shows and describes several preferred embodiments of the invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A distributed network data-based rapid acquisition method is based on a distributed network data rapid acquisition system and is characterized in that the distributed network data rapid acquisition system comprises five layers, namely a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer;
the seed website setting node is used for setting various parameters and extraction rules of a storage data source and is a single node;
the hyperlink acquisition layer is used for requesting a hyperlink list webpage of a data source and extracting a hyperlink of a target webpage;
the real-time queue is used for accessing the URL hyperlink extracted by the hyperlink acquisition layer, the extraction rule corresponding to the hyperlink and the accessed URL hyperlink;
the webpage downloading and analyzing layer is used for requesting and analyzing the URL hyperlink which is not accessed in the real-time queue and formatting and extracting specific data;
the webpage data storage layer is used for storing the target data which are formatted and extracted by the webpage downloading and analyzing layer;
the method comprises the following specific steps:
step 1, setting all seed URLs, extraction rules and website coding information by a seed website setting node;
step 2, the nodes in the hyperlink acquisition layer read the data source information at regular time and acquire the URL of the specific list page of the data source, and the data source information and the URL are formatted and stored in a real-time queue together with the corresponding structured extraction rule;
step 3, reading a webpage hyperlink request in a real-time queue in real time by a node in the webpage downloading and analyzing layer, downloading the webpage hyperlink request and extracting data in a structured manner by combining with an extraction rule accompanying the transmission, if the text data is successfully extracted, directly storing the webpage hyperlink request into a webpage data storage layer, otherwise, extracting the text by adopting a general text extraction method, and storing the webpage hyperlink request into the webpage data storage layer;
the step 1 specifically comprises the following steps: the method comprises the following steps that a user adds all target websites needing to be collected through a web system, wherein the target websites comprise a plurality of target sections which are interesting to the user, and then the following settings are carried out:
the method comprises the steps that firstly, set website information comprises a domain name, a type, a page code, a universal target URL filtering regular expression and a universal information extraction rule of a current website, and the information currently set by the website is suitable for all collection sections of the website;
secondly, setting the layout information of the website, including the name and seed URL of the layout, and setting the information at a corresponding position in a personalized manner if the domain name, the target URL filtering regular expression and the information extraction rule of the layout are different from the general setting of the website;
the step 2 specifically comprises the following steps: the nodes in the hyperlink acquisition layer read the website and the layout information set in the step 1 at regular time, format processing is carried out by taking the layouts as basic units, and the process is described in detail as follows for each acquisition layout:
step one, inheriting all information of a website to which a current edition block belongs;
secondly, if a domain name, URL filtering regular expression and page information extraction rule block are set, the block uses the personalized setting of the block to replace corresponding setting information inherited from website setting;
thirdly, requesting a URL page of the section seed, extracting a URL set of a target webpage by using a URL filtering regular expression of a current section, and acquiring a URL set by each section;
fourthly, taking a single element in each current URL set as a target URL, combining the target URL which is completely spliced by using the domain name set by the current plate with the plate information set in the corresponding second step to form a new multi-element group, wherein the information in the multi-element group comprises the page code, the website name, the plate name, the page code, the title extraction rule, the author extraction rule, the text extraction rule, the release time extraction rule and the website type of the target URL; and finally, pressing the set multi-element group into a real-time queue, and binding the target URL with the extraction rule in the process to realize one-to-one correspondence.
2. The distributed network data-based rapid acquisition method according to claim 1, wherein step 3 specifically comprises: a node in the webpage downloading and analyzing layer reads the multi-tuple set in the real-time queue in real time through the step 2, requests a target page, analyzes the page by adopting a corresponding extraction rule in the multi-tuple for the target page which is successfully requested, extracts information of a non-text part, and reports out that the rule is empty abnormal or extraction failure abnormal and records a log if the rule is empty or the extraction is invalid by adopting the current rule; for text extraction, if the set text extraction rule is empty or the extraction fails, a general text extraction method is adopted, and if the extraction fails, a general extraction failure exception is reported and a log is recorded; and if the extraction is successful, storing the corresponding data into the data storage layer in a single record mode.
3. The distributed network data-based rapid acquisition method according to claim 1, wherein the seed website setting node uses a relational database.
4. The distributed network-based data rapid acquisition method as claimed in claim 1, wherein the hyperlink acquisition layer is composed of a single or a plurality of web crawler nodes, the web crawlers are physically isolated from each other, logically cooperate to complete the hyperlink extraction work of the target web page, and can be transversely expanded according to the scale of the data source; the hyperlink acquisition layer is distributed and multi-node, the operation of a single node is timed, and the node integrates the acquired hyperlinks into the corresponding extraction rule and stores the extracted hyperlinks into a real-time queue.
5. The distributed network-based data rapid acquisition method of claim 1 wherein the real-time queue is independently deployed, unaccessed hyperlinks are accessed in real-time, and accessed hyperlinks are persistently stored.
6. The distributed network-based data rapid acquisition method as claimed in claim 1, wherein the web page downloading and parsing layer is composed of a plurality of web crawler nodes, the nodes work independently of each other and are responsible for the structured information extraction of the target data, and the hyperlink scale extracted by the hyperlink acquisition layer can be expanded horizontally; each node comprises a set of information structured extraction method based on HTML document tree and a set of general text information extraction method, and the extraction of the text part of the webpage can be switched for use;
the webpage downloading and analyzing layer is distributed and multi-node and can be transversely expanded, the operation of a single node is real-time, and the nodes read hyperlinks in the real-time queue, then filter, analyze and format target data and store the target data.
7. The distributed network-based data rapid acquisition method according to claim 1, wherein the webpage data storage layer adopts an open-source large data storage database, and the multiple nodes are expandable.
CN201610864062.6A 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method Active CN106484828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610864062.6A CN106484828B (en) 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610864062.6A CN106484828B (en) 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method

Publications (2)

Publication Number Publication Date
CN106484828A CN106484828A (en) 2017-03-08
CN106484828B true CN106484828B (en) 2020-01-21

Family

ID=58268931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610864062.6A Active CN106484828B (en) 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method

Country Status (1)

Country Link
CN (1) CN106484828B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108259459A (en) * 2017-11-16 2018-07-06 南方电网科学研究院有限责任公司 Internet data acquisition grasping system
CN108268433B (en) * 2018-02-26 2019-06-11 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN108573155B (en) * 2018-04-18 2020-10-16 北京知道创宇信息技术股份有限公司 Method and device for detecting vulnerability influence range, electronic equipment and storage medium
CN109446441B (en) * 2018-09-26 2020-11-03 北京邮电大学 General credible distributed acquisition and storage system for network community
CN111258969B (en) * 2018-11-30 2023-08-15 中国移动通信集团浙江有限公司 Internet access log analysis method and device
CN109947751B (en) * 2018-12-29 2023-04-07 医渡云(北京)技术有限公司 Medical data processing method and device, readable medium and electronic equipment
CN109815382B (en) * 2018-12-29 2022-07-12 中国科学院计算技术研究所 Method and system for sensing and acquiring large-scale network data
CN109840298B (en) * 2018-12-29 2021-09-24 中国科学院计算技术研究所 Multi-information-source acquisition method and system for large-scale network data
CN110262904B (en) * 2019-05-17 2022-10-14 北京达佳互联信息技术有限公司 Data acquisition method and device
CN111078975B (en) * 2019-12-23 2023-04-28 北京天元创新科技有限公司 Multi-node incremental data acquisition system and acquisition method
CN111680203B (en) * 2020-05-07 2023-04-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN112464065A (en) * 2020-06-06 2021-03-09 谢国柱 Big data acquisition method and system based on mobile internet
CN112100495B (en) * 2020-09-14 2024-04-16 山东亿云信息技术有限公司 Distributed-based one-stop acquisition method and acquisition system
CN112287254B (en) * 2020-11-23 2023-10-27 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN116070052A (en) * 2023-01-28 2023-05-05 爱集微咨询(厦门)有限公司 Interface data transmission method, device, terminal and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7502773B1 (en) * 2003-12-31 2009-03-10 Microsoft Corporation System and method facilitating page indexing employing reference information
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN105515815A (en) * 2014-10-17 2016-04-20 任子行网络技术股份有限公司 Heritrix-based distributed collection method and system
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7672943B2 (en) * 2006-10-26 2010-03-02 Microsoft Corporation Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7502773B1 (en) * 2003-12-31 2009-03-10 Microsoft Corporation System and method facilitating page indexing employing reference information
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN105515815A (en) * 2014-10-17 2016-04-20 任子行网络技术股份有限公司 Heritrix-based distributed collection method and system
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database

Also Published As

Publication number Publication date
CN106484828A (en) 2017-03-08

Similar Documents

Publication Publication Date Title
CN106484828B (en) Distributed internet data rapid acquisition system and acquisition method
CN102930059B (en) Method for designing focused crawler
Udapure et al. Study of web crawler and its different types
Yu et al. Summary of web crawler technology research
CN101599089B (en) Method and system for automatically searching and extracting update information on content of video service website
CN102710795B (en) Hotspot collecting method and device
US20080021924A1 (en) Method and system for creating a concept-object database
CN102663062A (en) Method and device for processing invalid links in search result
JP2008508575A (en) Aggregation and search methods using ecosystems and related technologies
Tarakeswar et al. Search engines: a study
EP2802979A2 (en) Processing store visiting data
KR20190131778A (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
JP4769822B2 (en) Information search service providing server, method and system using page group
CN101727471A (en) Website content retrieval system and method
CN111949619A (en) Dynamic directory generation method, system, electronic device and storage medium
Uddin et al. Building A large collection of multi-domain electronic theses and dissertations
US20190146954A1 (en) Hierarchical seedlists for application data
CN102236713A (en) Digital television interaction service page information extraction method and device
Jin Research on data retrieval and analysis system based on Baidu reptile technology in big data era
CN103823805B (en) Community-based correlation note commending system and recommendation method
Langhnoja et al. Web usage mining to discover visitor group with common behavior using DBSCAN clustering algorithm
KR20050117760A (en) Web scripting engine ini system
CN104462613A (en) Hot spot aggregating method and device
Chiniah et al. Categorising AWS Common Crawl dataset using mapreduce
Alqaraleh et al. Efficient watcher based web crawler design

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230810

Address after: 621000 1st floor, building 4, innovation center, science and innovation District, Mianyang City, Sichuan Province

Patentee after: Sichuan Zhongke rongchuang Technology Co.,Ltd.

Address before: 621010 No. 59, Qinglong Avenue, Mianyang City, Sichuan Province

Patentee before: Southwest University of Science and Technology