CN106484828A - A kind of distributed interconnection data Fast Acquisition System and acquisition method - Google Patents
A kind of distributed interconnection data Fast Acquisition System and acquisition method Download PDFInfo
- Publication number
- CN106484828A CN106484828A CN201610864062.6A CN201610864062A CN106484828A CN 106484828 A CN106484828 A CN 106484828A CN 201610864062 A CN201610864062 A CN 201610864062A CN 106484828 A CN106484828 A CN 106484828A
- Authority
- CN
- China
- Prior art keywords
- data
- hyperlink
- url
- real
- column
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a kind of distributed interconnection data Fast Acquisition System, node, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer are set including seed website;Seed website arranges node and is used for arranging parameters and the decimation rule in data storage source;Hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts the hyperlink of target web;Real-time queue is used for accessing URL hyperlink and its corresponding decimation rule and the URL hyperlink having accessed of the extraction of hyperlink acquisition layer;Page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue and format extraction particular data;Web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.The present invention carries out data acquisition using distributed layer cooperation mode, copes with the system application demand that data acquisition amount is big, Data Source is many, requirement of real-time is high.
Description
Technical field
The invention belongs to the acquisition technique field of the Internet big data is and in particular to a kind of distributed interconnection data is quick
Acquisition system and acquisition method.
Background technology
Society is brought into data is highly developed and disclosed information age by the developing rapidly of the Internet, and the big data epoch are already
Arrive.Data has extremely important effect for enterprise operation, the dynamic analysis of government decision and society etc., and how big rule
Mould, rapid data collection become technology focus, but from the point of view of prior art, collecting method has much room for improvement.Traditional
Internet data gathers mainly with web crawlers as main tool, enters line number with structuring or semi-structured text data for object
According to collection.Web crawlers is program or the script crawling the Internet text webpage according to the automatic migration of certain rule.Textual data
According to being mostly nested in web page program code.Effective promptness of the direct determination data of real-time of data acquisition, data fast
Speed collection becomes the most important thing.
Content of the invention
In view of this, the present invention is directed to the problem that data acquisition amount is big, Data Source is many and real-time is low, there is provided a kind of
Distributed interconnection data Fast Acquisition System and acquisition method.
In order to solve above-mentioned technical problem, the invention discloses a kind of distributed interconnection data Fast Acquisition System, bag
Include seed website setting node, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, web data accumulation layer five
Layer;
Wherein seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single section
Point;
Wherein hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts target web
Hyperlink;
Wherein real-time queue is used for accessing URL hyperlink, this hyperlink corresponding extraction rule of the extraction of hyperlink acquisition layer
The URL hyperlink then and accessing;
Wherein page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue form
Change and extract particular data;
Wherein web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.
Compared with prior art, the present invention can obtain including following technique effect:
1) acquisition system of the present invention adopts waterfall mode operation, and real-time is high, and extensibility is strong, in the face of Data Source is many,
The big system requirements of data acquisition amount has stronger adaptibility to response.
2) present invention carries out data acquisition using the acquisition mode that distributed layer cooperates, cope with Data Source many,
The system requirements that data acquisition amount is big, real-time is high, has the characteristics that higher extensibility, customizability simultaneously.Data is taken out
Take and accurately extract scheme and general two sets of extraction schemes of text extracting (just for body part) including structuring, extracting data has
Higher integrity.
Certainly, the arbitrary product implementing the present invention it is not absolutely required to reach all the above technique effect simultaneously.
Brief description
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the present invention, this
Bright schematic description and description is used for explaining the present invention, does not constitute inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the structure chart of distributed interconnection data Fast Acquisition System of the present invention;
Fig. 2 is the flow chart of distributed interconnection data Quick Acquisition method of the present invention.
Specific embodiment
To describe embodiments of the present invention below in conjunction with embodiment in detail, thereby to the present invention how application technology handss
Section is solving technical problem and to reach realizing process and fully understanding and implement according to this of technology effect.
A kind of present invention distributed interconnection data Fast Acquisition System, as shown in figure 1, include seed website setting section
Point, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer,
Wherein seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single section
Point;Seed website setting node uses relevant database.
Wherein hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts target web
Hyperlink;Hyperlink acquisition layer is made up of single or some web crawlers nodes, and these web crawlers are physically isolated from one another, patrol
Collect the hyperlink extraction work that cooperation completes target web, the scale according to data source can be extending transversely.Hyperlink acquisition layer is
Distributed multinode, can be extending transversely, the operation of individual node is timing, and the hyperlink gathering is integrated by this node layer
Corresponding decimation rule is stored in real-time queue.
Wherein real-time queue is used for accessing URL hyperlink, this hyperlink corresponding extraction rule of the extraction of hyperlink acquisition layer
The URL hyperlink then and accessing;Real-time queue is the core component crawling running node cooperative cooperating.This layer has very high
Real-time, be capable of being stored in real time, taking out of data, there is the ability of persistent storage, simultaneously the also URL to collection
Hyperlink plays a role in filtering.Real-time queue is independent deployment, the hyperlink having not visited is real time access, access
Hyperlink is lasting storage.
Wherein page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue form
Change and extract particular data.Page download hyperlink acquisition layer similar with analytic sheaf is made up of multiple web crawlers nodes, between node
Work independently of one another, be mainly responsible for the structured message extraction work of target data, the hyperlink extracted according to hyperlink acquisition layer
The scale of connecing can be extending transversely.It is general with a set of that each node includes a set of message structure abstracting method based on html document tree
Text message abstracting method, for the changeable use of extraction of Web page text part.Page download and analytic sheaf are distributed many
Node, can be extending transversely, the operation of individual node is real-time, carries out after the hyperlink in node reading real-time queue
Filter, parsing, formatting are extracted target data and are stored operation.
Wherein web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.Webpage number
Realized using big data data storage storehouse of increasing income according to accumulation layer, be multinode storage, storage web document data is had very
Strong storage capacity, amount of storage is big, and reading performance is excellent, dynamic extending number of nodes.
One kind is based on distributed network data Quick Acquisition method, based on above-mentioned distributed network data Quick Acquisition system
System, as shown in Fig. 2 comprise the following steps that:
Step 1, seed website setting node arranges the information such as all seed URL, decimation rule, website coding;
User by web system add institute collection in need targeted website, comprise in these targeted websites user feel emerging
Multiple target columns of interest such as education, people's livelihood etc., are then arranged as follows:
The first step, it is general that the site information of setting includes the domain name of current site, title, type, page coding, this website
Target url filtering regular expression and general information decimation rule (including author, issuing time, text etc.), this website is worked as
The information of front setting is applied to all collection columns under this website;As shown in table 1:
Table 1 website arranges table
Second step, the column information of setting website, including the title of column, seed URL, if the domain name of column, target URL
Filter regular expression and information extraction rules are different from the general setting of this website, then should in corresponding position personal settings
Place's information.Domain name as website setting is http://newssc.org/, and the domain name of column is http://
edu.newssc.org/.If some information of current column are identical with the general setting in website, this column need not be repeated
Setting." website 1 " as table 2 is directed in table 1 arranges its subordinate's column:
" website 1 " column setting table in table 2 table 1
Step 2, the node timing reading data source information in hyperlink acquisition layer gathered data source particular list page
URL, is stored in real-time queue after formatting and with corresponding structuring decimation rule;
The website of node timing read step 1 setting in hyperlink acquisition layer and column information, are substantially single with column
Position formats process, is specifically described as each collection this process of column:
The first step, inherits all information of current column affiliated web site;
Second step, if domain name, url filtering regular expressions, page info decimation rule (include author, issuing time, text
Deng) column arranged, this column will be substituted using itself personal settings and inherit, from the setting of website, the relative set come
Information, as shown in table 3:
Table 3 column formats
3rd step, asks the column seed URL page, and the url filtering regular expression using current column extracts target network
The set of URL of page closes (each column collects a set of URL and closes);
4th step, with the individual element in current each set of URL conjunction for target URL, using the domain of current column setting
Name complete target URL of splicing (because target URL may be with the page '/abc.html ' or ' ./abc.html ' mode
Exist, complete ' http need to be spliced into://cul.china.com.cn/abc.html ' form) set with reference in corresponding second step
The column information put forms new many tuples, such as:
Collect set of URL by the seed URL (i.e. S1) of column 1 to close:
SET_1:(link_1,link_2,link_3………..link_m)
The many tuples being then stored in queue are:
<Link_1, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>
<Link_2, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>
……………………………
<Link_m, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>;
Closed by the set of URL that the seed URL (S2) of column 2 gathers:
SET2:(link_1,link_2,link_3………..link_n)
Then it is stored in many tuples of queue:
<Link_1, column 1, website 1, GBK, 1, (general) rule 1, (general) rule 2, (general) rule 3, rule 4>
……………………………
<Link_n, column 1, website 1, GBK, 1, (general) rule 1, (general) rule 2, (general) rule 3, rule 4>.
Information in many tuples includes the page coding of this target URL, web site name, column title, page coding, mark
Topic decimation rule, author's decimation rule, text extracting rule, issuing time decimation rule, Type of website etc..Finally will set
Many tuples press-in real-time queues in, target URL is bound by this process with decimation rule, accomplishes to correspond.
Step 3, page download reads the hyperlinks between Web pages request download in real-time queue in real time with the node in analytic sheaf
And combine with the decimation rule structuring extraction data transmitted, if textual data extracts successfully, it is directly stored in web data and deposits
In reservoir, otherwise text extraction is carried out using the method for general text extracting, be then stored in web data accumulation layer.
Page download reads in the many tuples being set by step 2 in real-time queue with the node in analytic sheaf in real time,
The request target page, for asking successful target pages, is parsed to the page using the corresponding decimation rule in many tuples,
Non- text partial information is extracted, if rule was lost efficacy for empty or using current rule extraction, it is empty abnormal for quoting rule
Or extract unsuccessfully abnormal and log;For text extracting, if the text extracting rule of setting for sky or extracts unsuccessfully,
Then adopting general text extracting method (as standby extraction means, non-setting rule), if extracting unsuccessfully, quoting general extraction
The abnormal simultaneously log of failure;If extracting successfully, corresponding data is stored in data storage layer in the way of wall scroll record.
Note:Used in the present invention, general text extracting method (text extracting method based on density) is disclosed opinion
Literary composition, referring to periodical:[1] Zhu Zede, Li Miao, Zhang Jian, Chen Lei, Zeng Xinhua. the Web text extracting [J] based on text density model.
Pattern recognition and artificial intelligence, 2013,07:667-672.
Described above illustrate and describes some preferred embodiments of invention, but as previously mentioned it should be understood that inventing not
It is confined to form disclosed herein, be not to be taken as the exclusion to other embodiment, and can be used for various other combinations, modification
And environment, and can be carried out by the technology or knowledge of above-mentioned teaching or association area in invention contemplated scope described herein
Change.And the change that those skilled in the art are carried out and change without departing from the spirit and scope of invention, then all should weighed appended by invention
In the protection domain that profit requires.
Claims (10)
1. a kind of distributed interconnection data Fast Acquisition System is it is characterised in that include seed website setting node, hyperlink
Acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer;
Described seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single node;
Described hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts the hyperlink of target web
Connect;
Described real-time queue be used for access hyperlink acquisition layer extraction URL hyperlink, the corresponding decimation rule of this hyperlink and
The URL hyperlink having accessed;
Described page download and analytic sheaf are used for asking and parse the URL hyperlink having not visited in real-time queue formatting and carry
Take particular data;
Described web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.
2. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described seed website sets
Put node and use relevant database.
3. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described hyperlink gathers
Layer is made up of single or some web crawlers nodes, and these web crawlers are physically isolated from one another, logic cooperation completes target network
The hyperlink extraction work of page, the scale according to data source can be extending transversely;Hyperlink acquisition layer is distributed multinode, single
The operation of individual node is timing, and the hyperlink of collection is integrated corresponding decimation rule and is stored in real-time queue by this node layer.
4. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described real-time queue is
Independent deployment, the hyperlink having not visited is real time access, and the hyperlink of access is lasting storage.
5. distributed interconnection data Fast Acquisition System as claimed in claim 1 it is characterised in that described page download with
Analytic sheaf is made up of multiple web crawlers nodes, works independently of one another between node, is mainly responsible for the structured message of target data
Extract work, can be extending transversely according to the hyperlink scale that hyperlink acquisition layer extracts;Each node include a set of based on HTML
The message structure abstracting method of document tree and a set of general text message abstracting method, the extraction for Web page text part can
Switching uses;
Page download and analytic sheaf are distributed multinodes, can be extending transversely, and the operation of individual node is real-time, node
Carry out after reading the hyperlink in real-time queue filtering, parse, format extraction target data and store operation.
6. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described web data is deposited
Reservoir uses big data data storage storehouse of increasing income, and multinode is extendible.
7. one kind is based on distributed network data Quick Acquisition method, based on the arbitrary described distributed network of claim 1~6
Data Fast Acquisition System is it is characterised in that comprise the following steps that:
Step 1, seed website setting node arranges the information such as all seed URL, decimation rule, website coding;
Step 2, the URL of data source information gathered data source particular list page is read in the node timing in hyperlink acquisition layer,
It is stored in real-time queue after formatting and with corresponding structuring decimation rule;
Step 3, the hyperlinks between Web pages request that the node in page download and analytic sheaf reads in real-time queue in real time is downloaded and is tied
Close and extract data with the decimation rule structuring of transmission, if textual data extracts successfully, be directly stored in web data accumulation layer
In, otherwise text extraction is carried out using the method for general text extracting, be then stored in web data accumulation layer.
8. it is based on distributed network data Quick Acquisition method as claimed in claim 7 it is characterised in that step 1 is specially:
User by web system add collection in need targeted website, comprise interested multiple of user in these targeted websites
Target column, is then arranged as follows:
The first step, the site information of setting includes the domain name of current site, title, type, page coding, this website general target
Url filtering regular expression and general information decimation rule, the information of this website current setting is applied under this website all
Collection column;
Second step, the column information of setting website, including the title of column, seed URL, if the domain name of column, target url filtering
Regular expression and information extraction rules are different from the general setting of this website, then believe at this in corresponding position personal settings
Breath.
9. it is based on distributed network data Quick Acquisition method as claimed in claim 8 it is characterised in that step 2 is specially:
The website of node timing read step 1 setting in hyperlink acquisition layer and column information, carry out lattice with column for ultimate unit
Formulaization is processed, and is specifically described as each collection this process of column:
The first step, inherits all information of current column affiliated web site;
Second step, if domain name, url filtering regular expressions, page info decimation rule column are arranged, this column will use
Itself personal settings substitutes inherits, from the setting of website, the relative set information come;
3rd step, asks the column seed URL page, and the url filtering regular expression using current column extracts target web
Set of URL closes, and each column collects a set of URL and closes;
4th step, the individual element in being closed with each current set of URL for target URL, spell by the domain name using the setting of current column
Connect the column information setting in the complete corresponding second step of target URL combination and form new many tuples, the information in many tuples
Include the page coding of this target URL, web site name, column title, page coding, title decimation rule, author extract rule
Then, text extracting rule, issuing time decimation rule, Type of website etc.;Finally by the many tuple press-in real-time queues setting
In, target URL is bound by this process with decimation rule, accomplishes to correspond.
10. it is based on distributed network data Quick Acquisition method as claimed in claim 9 it is characterised in that step 3 is concrete
For:Page download reads the many tuples setting in real-time queue, request target by step 2 in real time with the node in analytic sheaf
The page, for asking successful target pages, is parsed to the page using the corresponding decimation rule in many tuples, for anon-normal
The information extraction of civilian part, if rule was lost efficacy for empty or using current rule extraction, quotes rule and for empty exception or takes out
Take unsuccessfully abnormal and log;For text extracting, if the text extracting rule of setting for sky or extracts unsuccessfully, adopt
General text extracting method, if extracting unsuccessfully, quoting and general extracting unsuccessfully abnormal and log;If extracting successfully, will be right
The data answered is stored in data storage layer in the way of wall scroll record.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610864062.6A CN106484828B (en) | 2016-09-29 | 2016-09-29 | Distributed internet data rapid acquisition system and acquisition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610864062.6A CN106484828B (en) | 2016-09-29 | 2016-09-29 | Distributed internet data rapid acquisition system and acquisition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484828A true CN106484828A (en) | 2017-03-08 |
CN106484828B CN106484828B (en) | 2020-01-21 |
Family
ID=58268931
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610864062.6A Active CN106484828B (en) | 2016-09-29 | 2016-09-29 | Distributed internet data rapid acquisition system and acquisition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484828B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108259459A (en) * | 2017-11-16 | 2018-07-06 | 南方电网科学研究院有限责任公司 | A kind of internet data acquires grasping system |
CN108268433A (en) * | 2018-02-26 | 2018-07-10 | 杭州数梦工场科技有限公司 | Title abstracting method and device based on webpage article |
CN108573155A (en) * | 2018-04-18 | 2018-09-25 | 北京知道创宇信息技术有限公司 | Detect method, apparatus, electronic equipment and the storage medium of loophole coverage |
CN109446441A (en) * | 2018-09-26 | 2019-03-08 | 北京邮电大学 | A kind of credible distributed capture storage system of general Web Community |
CN109815382A (en) * | 2018-12-29 | 2019-05-28 | 中国科学院计算技术研究所 | The perception and acquisition methods and system of large scale network data |
CN109840298A (en) * | 2018-12-29 | 2019-06-04 | 中国科学院计算技术研究所 | The multi information source acquisition method and system of large scale network data |
CN109947751A (en) * | 2018-12-29 | 2019-06-28 | 医渡云(北京)技术有限公司 | A kind of medical data processing method, device, readable medium and electronic equipment |
CN110262904A (en) * | 2019-05-17 | 2019-09-20 | 北京达佳互联信息技术有限公司 | Collecting method and device |
CN111078975A (en) * | 2019-12-23 | 2020-04-28 | 北京天元创新科技有限公司 | Multi-node incremental data acquisition system and acquisition method |
CN111258969A (en) * | 2018-11-30 | 2020-06-09 | 中国移动通信集团浙江有限公司 | Internet access log analysis method and device |
CN111680203A (en) * | 2020-05-07 | 2020-09-18 | 支付宝(杭州)信息技术有限公司 | Data acquisition method and device and electronic equipment |
CN111708931A (en) * | 2020-06-06 | 2020-09-25 | 谢国柱 | Big data acquisition method based on mobile internet and artificial intelligence cloud service platform |
CN112100495A (en) * | 2020-09-14 | 2020-12-18 | 山东亿云信息技术有限公司 | Distributed one-stop acquisition method and acquisition system |
CN112287254A (en) * | 2020-11-23 | 2021-01-29 | 武汉虹旭信息技术有限责任公司 | Webpage structured information extraction method and device, electronic equipment and storage medium |
CN116070052A (en) * | 2023-01-28 | 2023-05-05 | 爱集微咨询(厦门)有限公司 | Interface data transmission method, device, terminal and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080104113A1 (en) * | 2006-10-26 | 2008-05-01 | Microsoft Corporation | Uniform resource locator scoring for targeted web crawling |
US7502773B1 (en) * | 2003-12-31 | 2009-03-10 | Microsoft Corporation | System and method facilitating page indexing employing reference information |
CN102646129A (en) * | 2012-03-09 | 2012-08-22 | 武汉大学 | Topic-relative distributed web crawler system |
CN105515815A (en) * | 2014-10-17 | 2016-04-20 | 任子行网络技术股份有限公司 | Heritrix-based distributed collection method and system |
CN105956068A (en) * | 2016-04-27 | 2016-09-21 | 湖南蚁坊软件有限公司 | Webpage URL repetition elimination method based on distributed database |
-
2016
- 2016-09-29 CN CN201610864062.6A patent/CN106484828B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7502773B1 (en) * | 2003-12-31 | 2009-03-10 | Microsoft Corporation | System and method facilitating page indexing employing reference information |
US20080104113A1 (en) * | 2006-10-26 | 2008-05-01 | Microsoft Corporation | Uniform resource locator scoring for targeted web crawling |
CN102646129A (en) * | 2012-03-09 | 2012-08-22 | 武汉大学 | Topic-relative distributed web crawler system |
CN105515815A (en) * | 2014-10-17 | 2016-04-20 | 任子行网络技术股份有限公司 | Heritrix-based distributed collection method and system |
CN105956068A (en) * | 2016-04-27 | 2016-09-21 | 湖南蚁坊软件有限公司 | Webpage URL repetition elimination method based on distributed database |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108259459A (en) * | 2017-11-16 | 2018-07-06 | 南方电网科学研究院有限责任公司 | A kind of internet data acquires grasping system |
CN108268433A (en) * | 2018-02-26 | 2018-07-10 | 杭州数梦工场科技有限公司 | Title abstracting method and device based on webpage article |
CN108268433B (en) * | 2018-02-26 | 2019-06-11 | 杭州数梦工场科技有限公司 | Title abstracting method and device based on webpage article |
CN108573155B (en) * | 2018-04-18 | 2020-10-16 | 北京知道创宇信息技术股份有限公司 | Method and device for detecting vulnerability influence range, electronic equipment and storage medium |
CN108573155A (en) * | 2018-04-18 | 2018-09-25 | 北京知道创宇信息技术有限公司 | Detect method, apparatus, electronic equipment and the storage medium of loophole coverage |
CN109446441A (en) * | 2018-09-26 | 2019-03-08 | 北京邮电大学 | A kind of credible distributed capture storage system of general Web Community |
CN109446441B (en) * | 2018-09-26 | 2020-11-03 | 北京邮电大学 | General credible distributed acquisition and storage system for network community |
CN111258969B (en) * | 2018-11-30 | 2023-08-15 | 中国移动通信集团浙江有限公司 | Internet access log analysis method and device |
CN111258969A (en) * | 2018-11-30 | 2020-06-09 | 中国移动通信集团浙江有限公司 | Internet access log analysis method and device |
CN109815382A (en) * | 2018-12-29 | 2019-05-28 | 中国科学院计算技术研究所 | The perception and acquisition methods and system of large scale network data |
CN109840298A (en) * | 2018-12-29 | 2019-06-04 | 中国科学院计算技术研究所 | The multi information source acquisition method and system of large scale network data |
CN109947751A (en) * | 2018-12-29 | 2019-06-28 | 医渡云(北京)技术有限公司 | A kind of medical data processing method, device, readable medium and electronic equipment |
CN110262904B (en) * | 2019-05-17 | 2022-10-14 | 北京达佳互联信息技术有限公司 | Data acquisition method and device |
CN110262904A (en) * | 2019-05-17 | 2019-09-20 | 北京达佳互联信息技术有限公司 | Collecting method and device |
CN111078975A (en) * | 2019-12-23 | 2020-04-28 | 北京天元创新科技有限公司 | Multi-node incremental data acquisition system and acquisition method |
CN111078975B (en) * | 2019-12-23 | 2023-04-28 | 北京天元创新科技有限公司 | Multi-node incremental data acquisition system and acquisition method |
CN111680203A (en) * | 2020-05-07 | 2020-09-18 | 支付宝(杭州)信息技术有限公司 | Data acquisition method and device and electronic equipment |
CN111680203B (en) * | 2020-05-07 | 2023-04-18 | 支付宝(杭州)信息技术有限公司 | Data acquisition method and device and electronic equipment |
CN111708931A (en) * | 2020-06-06 | 2020-09-25 | 谢国柱 | Big data acquisition method based on mobile internet and artificial intelligence cloud service platform |
CN111708931B (en) * | 2020-06-06 | 2020-12-25 | 湖南伟业动物营养集团股份有限公司 | Big data acquisition method based on mobile internet and artificial intelligence cloud service platform |
CN112100495A (en) * | 2020-09-14 | 2020-12-18 | 山东亿云信息技术有限公司 | Distributed one-stop acquisition method and acquisition system |
CN112100495B (en) * | 2020-09-14 | 2024-04-16 | 山东亿云信息技术有限公司 | Distributed-based one-stop acquisition method and acquisition system |
CN112287254A (en) * | 2020-11-23 | 2021-01-29 | 武汉虹旭信息技术有限责任公司 | Webpage structured information extraction method and device, electronic equipment and storage medium |
CN112287254B (en) * | 2020-11-23 | 2023-10-27 | 武汉虹旭信息技术有限责任公司 | Webpage structured information extraction method and device, electronic equipment and storage medium |
CN116070052A (en) * | 2023-01-28 | 2023-05-05 | 爱集微咨询(厦门)有限公司 | Interface data transmission method, device, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106484828B (en) | 2020-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106484828A (en) | A kind of distributed interconnection data Fast Acquisition System and acquisition method | |
KR101527259B1 (en) | Providing posts to discussion threads in response to a search query | |
US9940391B2 (en) | System, method and computer readable medium for web crawling | |
CN103399861B (en) | A kind of network address in Web side navigation recommends methods, devices and systems | |
CN102663062A (en) | Method and device for processing invalid links in search result | |
CN104182482B (en) | A kind of news list page determination methods and the method for screening news list page | |
CN103279567A (en) | Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language) | |
KR102222287B1 (en) | Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL | |
CN103744856A (en) | Method, device and system for linkage extended search | |
CN102222098A (en) | Method and system for pre-fetching webpage | |
CN105302876A (en) | Regular expression based URL filtering method | |
CN102880679B (en) | A kind of info web storage means and device | |
CN103123640A (en) | Method and device for searching novel | |
CN103605742B (en) | Recognize the method and device of Internet resources entity catalogue page | |
Thelwall | A publicly accessible database of UK university website links and a discussion of the need for human intervention in web crawling | |
JP4253315B2 (en) | Knowledge information collecting system and knowledge information collecting method | |
CN108763583A (en) | A kind of microblog hot topic extracting method and system based on keyword search | |
Matsunaga et al. | A web syllabus crawler and its efficiency evaluation | |
Liu et al. | User Browsing Graph: Structure, Evolution and Application. | |
Huang et al. | TREC 2018 News Track. | |
JP4831728B2 (en) | Marketing system using web bookmarks | |
JP6510452B2 (en) | Search server, search system, search information distribution system, search program, search information distribution program | |
Madaan et al. | A novel architecture for a blog crawler | |
JP3725087B2 (en) | Knowledge information collecting system and knowledge information collecting method | |
CN102103622A (en) | Web site rebuilding system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230810 Address after: 621000 1st floor, building 4, innovation center, science and innovation District, Mianyang City, Sichuan Province Patentee after: Sichuan Zhongke rongchuang Technology Co.,Ltd. Address before: 621010 No. 59, Qinglong Avenue, Mianyang City, Sichuan Province Patentee before: Southwest University of Science and Technology |