CN106484828A - A kind of distributed interconnection data Fast Acquisition System and acquisition method - Google Patents

A kind of distributed interconnection data Fast Acquisition System and acquisition method Download PDF

Info

Publication number
CN106484828A
CN106484828A CN201610864062.6A CN201610864062A CN106484828A CN 106484828 A CN106484828 A CN 106484828A CN 201610864062 A CN201610864062 A CN 201610864062A CN 106484828 A CN106484828 A CN 106484828A
Authority
CN
China
Prior art keywords
data
hyperlink
url
real
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610864062.6A
Other languages
Chinese (zh)
Other versions
CN106484828B (en
Inventor
张晖
杨春明
李晓伟
李波
赵旭剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Zhongke Rongchuang Technology Co ltd
Original Assignee
Southwest University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University of Science and Technology filed Critical Southwest University of Science and Technology
Priority to CN201610864062.6A priority Critical patent/CN106484828B/en
Publication of CN106484828A publication Critical patent/CN106484828A/en
Application granted granted Critical
Publication of CN106484828B publication Critical patent/CN106484828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of distributed interconnection data Fast Acquisition System, node, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer are set including seed website;Seed website arranges node and is used for arranging parameters and the decimation rule in data storage source;Hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts the hyperlink of target web;Real-time queue is used for accessing URL hyperlink and its corresponding decimation rule and the URL hyperlink having accessed of the extraction of hyperlink acquisition layer;Page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue and format extraction particular data;Web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.The present invention carries out data acquisition using distributed layer cooperation mode, copes with the system application demand that data acquisition amount is big, Data Source is many, requirement of real-time is high.

Description

A kind of distributed interconnection data Fast Acquisition System and acquisition method
Technical field
The invention belongs to the acquisition technique field of the Internet big data is and in particular to a kind of distributed interconnection data is quick Acquisition system and acquisition method.
Background technology
Society is brought into data is highly developed and disclosed information age by the developing rapidly of the Internet, and the big data epoch are already Arrive.Data has extremely important effect for enterprise operation, the dynamic analysis of government decision and society etc., and how big rule Mould, rapid data collection become technology focus, but from the point of view of prior art, collecting method has much room for improvement.Traditional Internet data gathers mainly with web crawlers as main tool, enters line number with structuring or semi-structured text data for object According to collection.Web crawlers is program or the script crawling the Internet text webpage according to the automatic migration of certain rule.Textual data According to being mostly nested in web page program code.Effective promptness of the direct determination data of real-time of data acquisition, data fast Speed collection becomes the most important thing.
Content of the invention
In view of this, the present invention is directed to the problem that data acquisition amount is big, Data Source is many and real-time is low, there is provided a kind of Distributed interconnection data Fast Acquisition System and acquisition method.
In order to solve above-mentioned technical problem, the invention discloses a kind of distributed interconnection data Fast Acquisition System, bag Include seed website setting node, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, web data accumulation layer five Layer;
Wherein seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single section Point;
Wherein hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts target web Hyperlink;
Wherein real-time queue is used for accessing URL hyperlink, this hyperlink corresponding extraction rule of the extraction of hyperlink acquisition layer The URL hyperlink then and accessing;
Wherein page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue form Change and extract particular data;
Wherein web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.
Compared with prior art, the present invention can obtain including following technique effect:
1) acquisition system of the present invention adopts waterfall mode operation, and real-time is high, and extensibility is strong, in the face of Data Source is many, The big system requirements of data acquisition amount has stronger adaptibility to response.
2) present invention carries out data acquisition using the acquisition mode that distributed layer cooperates, cope with Data Source many, The system requirements that data acquisition amount is big, real-time is high, has the characteristics that higher extensibility, customizability simultaneously.Data is taken out Take and accurately extract scheme and general two sets of extraction schemes of text extracting (just for body part) including structuring, extracting data has Higher integrity.
Certainly, the arbitrary product implementing the present invention it is not absolutely required to reach all the above technique effect simultaneously.
Brief description
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the present invention, this Bright schematic description and description is used for explaining the present invention, does not constitute inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the structure chart of distributed interconnection data Fast Acquisition System of the present invention;
Fig. 2 is the flow chart of distributed interconnection data Quick Acquisition method of the present invention.
Specific embodiment
To describe embodiments of the present invention below in conjunction with embodiment in detail, thereby to the present invention how application technology handss Section is solving technical problem and to reach realizing process and fully understanding and implement according to this of technology effect.
A kind of present invention distributed interconnection data Fast Acquisition System, as shown in figure 1, include seed website setting section Point, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer,
Wherein seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single section Point;Seed website setting node uses relevant database.
Wherein hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts target web Hyperlink;Hyperlink acquisition layer is made up of single or some web crawlers nodes, and these web crawlers are physically isolated from one another, patrol Collect the hyperlink extraction work that cooperation completes target web, the scale according to data source can be extending transversely.Hyperlink acquisition layer is Distributed multinode, can be extending transversely, the operation of individual node is timing, and the hyperlink gathering is integrated by this node layer Corresponding decimation rule is stored in real-time queue.
Wherein real-time queue is used for accessing URL hyperlink, this hyperlink corresponding extraction rule of the extraction of hyperlink acquisition layer The URL hyperlink then and accessing;Real-time queue is the core component crawling running node cooperative cooperating.This layer has very high Real-time, be capable of being stored in real time, taking out of data, there is the ability of persistent storage, simultaneously the also URL to collection Hyperlink plays a role in filtering.Real-time queue is independent deployment, the hyperlink having not visited is real time access, access Hyperlink is lasting storage.
Wherein page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue form Change and extract particular data.Page download hyperlink acquisition layer similar with analytic sheaf is made up of multiple web crawlers nodes, between node Work independently of one another, be mainly responsible for the structured message extraction work of target data, the hyperlink extracted according to hyperlink acquisition layer The scale of connecing can be extending transversely.It is general with a set of that each node includes a set of message structure abstracting method based on html document tree Text message abstracting method, for the changeable use of extraction of Web page text part.Page download and analytic sheaf are distributed many Node, can be extending transversely, the operation of individual node is real-time, carries out after the hyperlink in node reading real-time queue Filter, parsing, formatting are extracted target data and are stored operation.
Wherein web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.Webpage number Realized using big data data storage storehouse of increasing income according to accumulation layer, be multinode storage, storage web document data is had very Strong storage capacity, amount of storage is big, and reading performance is excellent, dynamic extending number of nodes.
One kind is based on distributed network data Quick Acquisition method, based on above-mentioned distributed network data Quick Acquisition system System, as shown in Fig. 2 comprise the following steps that:
Step 1, seed website setting node arranges the information such as all seed URL, decimation rule, website coding;
User by web system add institute collection in need targeted website, comprise in these targeted websites user feel emerging Multiple target columns of interest such as education, people's livelihood etc., are then arranged as follows:
The first step, it is general that the site information of setting includes the domain name of current site, title, type, page coding, this website Target url filtering regular expression and general information decimation rule (including author, issuing time, text etc.), this website is worked as The information of front setting is applied to all collection columns under this website;As shown in table 1:
Table 1 website arranges table
Second step, the column information of setting website, including the title of column, seed URL, if the domain name of column, target URL Filter regular expression and information extraction rules are different from the general setting of this website, then should in corresponding position personal settings Place's information.Domain name as website setting is http://newssc.org/, and the domain name of column is http:// edu.newssc.org/.If some information of current column are identical with the general setting in website, this column need not be repeated Setting." website 1 " as table 2 is directed in table 1 arranges its subordinate's column:
" website 1 " column setting table in table 2 table 1
Step 2, the node timing reading data source information in hyperlink acquisition layer gathered data source particular list page URL, is stored in real-time queue after formatting and with corresponding structuring decimation rule;
The website of node timing read step 1 setting in hyperlink acquisition layer and column information, are substantially single with column Position formats process, is specifically described as each collection this process of column:
The first step, inherits all information of current column affiliated web site;
Second step, if domain name, url filtering regular expressions, page info decimation rule (include author, issuing time, text Deng) column arranged, this column will be substituted using itself personal settings and inherit, from the setting of website, the relative set come Information, as shown in table 3:
Table 3 column formats
3rd step, asks the column seed URL page, and the url filtering regular expression using current column extracts target network The set of URL of page closes (each column collects a set of URL and closes);
4th step, with the individual element in current each set of URL conjunction for target URL, using the domain of current column setting Name complete target URL of splicing (because target URL may be with the page '/abc.html ' or ' ./abc.html ' mode Exist, complete ' http need to be spliced into://cul.china.com.cn/abc.html ' form) set with reference in corresponding second step The column information put forms new many tuples, such as:
Collect set of URL by the seed URL (i.e. S1) of column 1 to close:
SET_1:(link_1,link_2,link_3………..link_m)
The many tuples being then stored in queue are:
<Link_1, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>
<Link_2, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>
……………………………
<Link_m, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>;
Closed by the set of URL that the seed URL (S2) of column 2 gathers:
SET2:(link_1,link_2,link_3………..link_n)
Then it is stored in many tuples of queue:
<Link_1, column 1, website 1, GBK, 1, (general) rule 1, (general) rule 2, (general) rule 3, rule 4>
……………………………
<Link_n, column 1, website 1, GBK, 1, (general) rule 1, (general) rule 2, (general) rule 3, rule 4>.
Information in many tuples includes the page coding of this target URL, web site name, column title, page coding, mark Topic decimation rule, author's decimation rule, text extracting rule, issuing time decimation rule, Type of website etc..Finally will set Many tuples press-in real-time queues in, target URL is bound by this process with decimation rule, accomplishes to correspond.
Step 3, page download reads the hyperlinks between Web pages request download in real-time queue in real time with the node in analytic sheaf And combine with the decimation rule structuring extraction data transmitted, if textual data extracts successfully, it is directly stored in web data and deposits In reservoir, otherwise text extraction is carried out using the method for general text extracting, be then stored in web data accumulation layer.
Page download reads in the many tuples being set by step 2 in real-time queue with the node in analytic sheaf in real time, The request target page, for asking successful target pages, is parsed to the page using the corresponding decimation rule in many tuples, Non- text partial information is extracted, if rule was lost efficacy for empty or using current rule extraction, it is empty abnormal for quoting rule Or extract unsuccessfully abnormal and log;For text extracting, if the text extracting rule of setting for sky or extracts unsuccessfully, Then adopting general text extracting method (as standby extraction means, non-setting rule), if extracting unsuccessfully, quoting general extraction The abnormal simultaneously log of failure;If extracting successfully, corresponding data is stored in data storage layer in the way of wall scroll record.
Note:Used in the present invention, general text extracting method (text extracting method based on density) is disclosed opinion Literary composition, referring to periodical:[1] Zhu Zede, Li Miao, Zhang Jian, Chen Lei, Zeng Xinhua. the Web text extracting [J] based on text density model. Pattern recognition and artificial intelligence, 2013,07:667-672.
Described above illustrate and describes some preferred embodiments of invention, but as previously mentioned it should be understood that inventing not It is confined to form disclosed herein, be not to be taken as the exclusion to other embodiment, and can be used for various other combinations, modification And environment, and can be carried out by the technology or knowledge of above-mentioned teaching or association area in invention contemplated scope described herein Change.And the change that those skilled in the art are carried out and change without departing from the spirit and scope of invention, then all should weighed appended by invention In the protection domain that profit requires.

Claims (10)

1. a kind of distributed interconnection data Fast Acquisition System is it is characterised in that include seed website setting node, hyperlink Acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer;
Described seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single node;
Described hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts the hyperlink of target web Connect;
Described real-time queue be used for access hyperlink acquisition layer extraction URL hyperlink, the corresponding decimation rule of this hyperlink and The URL hyperlink having accessed;
Described page download and analytic sheaf are used for asking and parse the URL hyperlink having not visited in real-time queue formatting and carry Take particular data;
Described web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.
2. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described seed website sets Put node and use relevant database.
3. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described hyperlink gathers Layer is made up of single or some web crawlers nodes, and these web crawlers are physically isolated from one another, logic cooperation completes target network The hyperlink extraction work of page, the scale according to data source can be extending transversely;Hyperlink acquisition layer is distributed multinode, single The operation of individual node is timing, and the hyperlink of collection is integrated corresponding decimation rule and is stored in real-time queue by this node layer.
4. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described real-time queue is Independent deployment, the hyperlink having not visited is real time access, and the hyperlink of access is lasting storage.
5. distributed interconnection data Fast Acquisition System as claimed in claim 1 it is characterised in that described page download with Analytic sheaf is made up of multiple web crawlers nodes, works independently of one another between node, is mainly responsible for the structured message of target data Extract work, can be extending transversely according to the hyperlink scale that hyperlink acquisition layer extracts;Each node include a set of based on HTML The message structure abstracting method of document tree and a set of general text message abstracting method, the extraction for Web page text part can Switching uses;
Page download and analytic sheaf are distributed multinodes, can be extending transversely, and the operation of individual node is real-time, node Carry out after reading the hyperlink in real-time queue filtering, parse, format extraction target data and store operation.
6. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described web data is deposited Reservoir uses big data data storage storehouse of increasing income, and multinode is extendible.
7. one kind is based on distributed network data Quick Acquisition method, based on the arbitrary described distributed network of claim 1~6 Data Fast Acquisition System is it is characterised in that comprise the following steps that:
Step 1, seed website setting node arranges the information such as all seed URL, decimation rule, website coding;
Step 2, the URL of data source information gathered data source particular list page is read in the node timing in hyperlink acquisition layer, It is stored in real-time queue after formatting and with corresponding structuring decimation rule;
Step 3, the hyperlinks between Web pages request that the node in page download and analytic sheaf reads in real-time queue in real time is downloaded and is tied Close and extract data with the decimation rule structuring of transmission, if textual data extracts successfully, be directly stored in web data accumulation layer In, otherwise text extraction is carried out using the method for general text extracting, be then stored in web data accumulation layer.
8. it is based on distributed network data Quick Acquisition method as claimed in claim 7 it is characterised in that step 1 is specially: User by web system add collection in need targeted website, comprise interested multiple of user in these targeted websites Target column, is then arranged as follows:
The first step, the site information of setting includes the domain name of current site, title, type, page coding, this website general target Url filtering regular expression and general information decimation rule, the information of this website current setting is applied under this website all Collection column;
Second step, the column information of setting website, including the title of column, seed URL, if the domain name of column, target url filtering Regular expression and information extraction rules are different from the general setting of this website, then believe at this in corresponding position personal settings Breath.
9. it is based on distributed network data Quick Acquisition method as claimed in claim 8 it is characterised in that step 2 is specially: The website of node timing read step 1 setting in hyperlink acquisition layer and column information, carry out lattice with column for ultimate unit Formulaization is processed, and is specifically described as each collection this process of column:
The first step, inherits all information of current column affiliated web site;
Second step, if domain name, url filtering regular expressions, page info decimation rule column are arranged, this column will use Itself personal settings substitutes inherits, from the setting of website, the relative set information come;
3rd step, asks the column seed URL page, and the url filtering regular expression using current column extracts target web Set of URL closes, and each column collects a set of URL and closes;
4th step, the individual element in being closed with each current set of URL for target URL, spell by the domain name using the setting of current column Connect the column information setting in the complete corresponding second step of target URL combination and form new many tuples, the information in many tuples Include the page coding of this target URL, web site name, column title, page coding, title decimation rule, author extract rule Then, text extracting rule, issuing time decimation rule, Type of website etc.;Finally by the many tuple press-in real-time queues setting In, target URL is bound by this process with decimation rule, accomplishes to correspond.
10. it is based on distributed network data Quick Acquisition method as claimed in claim 9 it is characterised in that step 3 is concrete For:Page download reads the many tuples setting in real-time queue, request target by step 2 in real time with the node in analytic sheaf The page, for asking successful target pages, is parsed to the page using the corresponding decimation rule in many tuples, for anon-normal The information extraction of civilian part, if rule was lost efficacy for empty or using current rule extraction, quotes rule and for empty exception or takes out Take unsuccessfully abnormal and log;For text extracting, if the text extracting rule of setting for sky or extracts unsuccessfully, adopt General text extracting method, if extracting unsuccessfully, quoting and general extracting unsuccessfully abnormal and log;If extracting successfully, will be right The data answered is stored in data storage layer in the way of wall scroll record.
CN201610864062.6A 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method Active CN106484828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610864062.6A CN106484828B (en) 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610864062.6A CN106484828B (en) 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method

Publications (2)

Publication Number Publication Date
CN106484828A true CN106484828A (en) 2017-03-08
CN106484828B CN106484828B (en) 2020-01-21

Family

ID=58268931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610864062.6A Active CN106484828B (en) 2016-09-29 2016-09-29 Distributed internet data rapid acquisition system and acquisition method

Country Status (1)

Country Link
CN (1) CN106484828B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108259459A (en) * 2017-11-16 2018-07-06 南方电网科学研究院有限责任公司 A kind of internet data acquires grasping system
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN108573155A (en) * 2018-04-18 2018-09-25 北京知道创宇信息技术有限公司 Detect method, apparatus, electronic equipment and the storage medium of loophole coverage
CN109446441A (en) * 2018-09-26 2019-03-08 北京邮电大学 A kind of credible distributed capture storage system of general Web Community
CN109815382A (en) * 2018-12-29 2019-05-28 中国科学院计算技术研究所 The perception and acquisition methods and system of large scale network data
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN109947751A (en) * 2018-12-29 2019-06-28 医渡云(北京)技术有限公司 A kind of medical data processing method, device, readable medium and electronic equipment
CN110262904A (en) * 2019-05-17 2019-09-20 北京达佳互联信息技术有限公司 Collecting method and device
CN111078975A (en) * 2019-12-23 2020-04-28 北京天元创新科技有限公司 Multi-node incremental data acquisition system and acquisition method
CN111258969A (en) * 2018-11-30 2020-06-09 中国移动通信集团浙江有限公司 Internet access log analysis method and device
CN111680203A (en) * 2020-05-07 2020-09-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN111708931A (en) * 2020-06-06 2020-09-25 谢国柱 Big data acquisition method based on mobile internet and artificial intelligence cloud service platform
CN112100495A (en) * 2020-09-14 2020-12-18 山东亿云信息技术有限公司 Distributed one-stop acquisition method and acquisition system
CN112287254A (en) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN116070052A (en) * 2023-01-28 2023-05-05 爱集微咨询(厦门)有限公司 Interface data transmission method, device, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080104113A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Uniform resource locator scoring for targeted web crawling
US7502773B1 (en) * 2003-12-31 2009-03-10 Microsoft Corporation System and method facilitating page indexing employing reference information
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN105515815A (en) * 2014-10-17 2016-04-20 任子行网络技术股份有限公司 Heritrix-based distributed collection method and system
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7502773B1 (en) * 2003-12-31 2009-03-10 Microsoft Corporation System and method facilitating page indexing employing reference information
US20080104113A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Uniform resource locator scoring for targeted web crawling
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN105515815A (en) * 2014-10-17 2016-04-20 任子行网络技术股份有限公司 Heritrix-based distributed collection method and system
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108259459A (en) * 2017-11-16 2018-07-06 南方电网科学研究院有限责任公司 A kind of internet data acquires grasping system
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN108268433B (en) * 2018-02-26 2019-06-11 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN108573155B (en) * 2018-04-18 2020-10-16 北京知道创宇信息技术股份有限公司 Method and device for detecting vulnerability influence range, electronic equipment and storage medium
CN108573155A (en) * 2018-04-18 2018-09-25 北京知道创宇信息技术有限公司 Detect method, apparatus, electronic equipment and the storage medium of loophole coverage
CN109446441A (en) * 2018-09-26 2019-03-08 北京邮电大学 A kind of credible distributed capture storage system of general Web Community
CN109446441B (en) * 2018-09-26 2020-11-03 北京邮电大学 General credible distributed acquisition and storage system for network community
CN111258969B (en) * 2018-11-30 2023-08-15 中国移动通信集团浙江有限公司 Internet access log analysis method and device
CN111258969A (en) * 2018-11-30 2020-06-09 中国移动通信集团浙江有限公司 Internet access log analysis method and device
CN109815382A (en) * 2018-12-29 2019-05-28 中国科学院计算技术研究所 The perception and acquisition methods and system of large scale network data
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN109947751A (en) * 2018-12-29 2019-06-28 医渡云(北京)技术有限公司 A kind of medical data processing method, device, readable medium and electronic equipment
CN110262904B (en) * 2019-05-17 2022-10-14 北京达佳互联信息技术有限公司 Data acquisition method and device
CN110262904A (en) * 2019-05-17 2019-09-20 北京达佳互联信息技术有限公司 Collecting method and device
CN111078975A (en) * 2019-12-23 2020-04-28 北京天元创新科技有限公司 Multi-node incremental data acquisition system and acquisition method
CN111078975B (en) * 2019-12-23 2023-04-28 北京天元创新科技有限公司 Multi-node incremental data acquisition system and acquisition method
CN111680203A (en) * 2020-05-07 2020-09-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN111680203B (en) * 2020-05-07 2023-04-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN111708931A (en) * 2020-06-06 2020-09-25 谢国柱 Big data acquisition method based on mobile internet and artificial intelligence cloud service platform
CN111708931B (en) * 2020-06-06 2020-12-25 湖南伟业动物营养集团股份有限公司 Big data acquisition method based on mobile internet and artificial intelligence cloud service platform
CN112100495A (en) * 2020-09-14 2020-12-18 山东亿云信息技术有限公司 Distributed one-stop acquisition method and acquisition system
CN112100495B (en) * 2020-09-14 2024-04-16 山东亿云信息技术有限公司 Distributed-based one-stop acquisition method and acquisition system
CN112287254A (en) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN112287254B (en) * 2020-11-23 2023-10-27 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN116070052A (en) * 2023-01-28 2023-05-05 爱集微咨询(厦门)有限公司 Interface data transmission method, device, terminal and storage medium

Also Published As

Publication number Publication date
CN106484828B (en) 2020-01-21

Similar Documents

Publication Publication Date Title
CN106484828A (en) A kind of distributed interconnection data Fast Acquisition System and acquisition method
KR101527259B1 (en) Providing posts to discussion threads in response to a search query
US9940391B2 (en) System, method and computer readable medium for web crawling
CN103399861B (en) A kind of network address in Web side navigation recommends methods, devices and systems
CN102663062A (en) Method and device for processing invalid links in search result
CN104182482B (en) A kind of news list page determination methods and the method for screening news list page
CN103279567A (en) Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
KR102222287B1 (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
CN103744856A (en) Method, device and system for linkage extended search
CN102222098A (en) Method and system for pre-fetching webpage
CN105302876A (en) Regular expression based URL filtering method
CN102880679B (en) A kind of info web storage means and device
CN103123640A (en) Method and device for searching novel
CN103605742B (en) Recognize the method and device of Internet resources entity catalogue page
Thelwall A publicly accessible database of UK university website links and a discussion of the need for human intervention in web crawling
JP4253315B2 (en) Knowledge information collecting system and knowledge information collecting method
CN108763583A (en) A kind of microblog hot topic extracting method and system based on keyword search
Matsunaga et al. A web syllabus crawler and its efficiency evaluation
Liu et al. User Browsing Graph: Structure, Evolution and Application.
Huang et al. TREC 2018 News Track.
JP4831728B2 (en) Marketing system using web bookmarks
JP6510452B2 (en) Search server, search system, search information distribution system, search program, search information distribution program
Madaan et al. A novel architecture for a blog crawler
JP3725087B2 (en) Knowledge information collecting system and knowledge information collecting method
CN102103622A (en) Web site rebuilding system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230810

Address after: 621000 1st floor, building 4, innovation center, science and innovation District, Mianyang City, Sichuan Province

Patentee after: Sichuan Zhongke rongchuang Technology Co.,Ltd.

Address before: 621010 No. 59, Qinglong Avenue, Mianyang City, Sichuan Province

Patentee before: Southwest University of Science and Technology