CN105955984A - Network data searching method based on crawler mode - Google Patents

Network data searching method based on crawler mode Download PDF

Info

Publication number
CN105955984A
CN105955984A CN201610242327.9A CN201610242327A CN105955984A CN 105955984 A CN105955984 A CN 105955984A CN 201610242327 A CN201610242327 A CN 201610242327A CN 105955984 A CN105955984 A CN 105955984A
Authority
CN
China
Prior art keywords
url
configuration file
target pages
reptile
network data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610242327.9A
Other languages
Chinese (zh)
Inventor
郑建宾
李忠洵
华锦芝
高峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN201610242327.9A priority Critical patent/CN105955984A/en
Publication of CN105955984A publication Critical patent/CN105955984A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention puts forward a network data searching method based on a crawler mode. The method includes: carrying out configuration operation to set configuration file; performing initialization operation based on the configuration file to generate a regular expression based directory page and target page filter and a target page path template; carrying out formal crawling operation based on the directory page and target page filter and the target page template so as to acquire data from a confirmed target webpage and save data. The network data searching method based on the crawler mode disclosed by the invention has high universality and searching accuracy, and is convenient to use.

Description

Searching network data method based on reptile pattern
Technical field
The present invention relates to data search method, more particularly, to searching network data method based on reptile pattern.
Background technology
At present, along with the becoming increasingly abundant of class of business of the increasingly extensive and different field of cyber-net application, scan for obtaining target data interested for the comprised Russia of the webpage in website mass data and become more and more important.
In existing technical scheme, generally the mode of the focused crawler of employing subject-oriented webpage from website gathers data.
But, there are the following problems for above-mentioned existing technical scheme: can only process the page data of a certain type structure of certain separate site due to a single crawlers, thus versatility is poor, i.e. need design for different demand and write different crawlers, thus causing cost up and difficult in maintenance.
Accordingly, there exist following demand: provide and there is higher versatility and searching accuracy and searching network data method based on reptile pattern easy to use.
Summary of the invention
In order to solve the problem existing for above-mentioned prior art, the present invention proposes has higher versatility and searching accuracy and searching network data method based on reptile pattern easy to use.
It is an object of the invention to be achieved through the following technical solutions:
A kind of searching network data method based on reptile pattern, described searching network data method based on reptile pattern comprises the following steps:
(A1) configuration operation is performed to arrange configuration file;
(A2) initialization operation is performed based on described configuration file, to generate catalog page based on regular expression and target pages filter and target pages path template;
(A3) perform formal to crawl operation, to obtain from the target web confirmed and to store data based on described catalog page and target pages filter and target pages template.
In scheme disclosed above, it is preferable that described step (A1) farther includes: arrange the first configuration file, it is the linking inlet ports crawling operation described in trigger that described first configuration file is used for preserving seed URL, described seed URL.
In scheme disclosed above, preferably, described step (A1) farther includes: arrange the second configuration file, described second configuration file is for preserving the one or more key words being associated with target pages, in order to auxiliary needle, the confirmation of target pages is operated, wherein, described second configuration file is further used for defining the logical relation between each key word.
In scheme disclosed above, it is preferable that described step (A1) farther includes: arrange the 3rd configuration file, described 3rd configuration file is for preserving the URL of the target pages that at least two has been confirmed by user.
In scheme disclosed above, it is preferable that described step (A1) farther includes: arrange the 4th configuration file, described 4th configuration file is used for preserving basic configuration information and other auxiliary configuration informations.
In scheme disclosed above, preferably, described step (A1) farther includes: arrange the 5th configuration file, described 5th configuration file is for preserving the one or more labels that need to crawl, and wherein said one or more labels comprise label and the content thereof of first indicated webpage in the URL arranged in described 3rd configuration file.
In scheme disclosed above, preferably, described step (A1) farther includes: when the label substance of the one or more labels in one or more labels that described need crawl is the unfixed many data of quantity, then preserve any two in the unfixed many data of described quantity in the label substance of its correspondence.
In scheme disclosed above, preferably, described step (A1) farther includes: when the link that the label substance of the one or more labels in one or more labels that described need crawl is correlator labels page, then add " subtab " labelling in the relevant position of described 5th configuration file, the subtab of " subtab " labelling lock instruction should preserve in the link of the subtab page and this subtab page subtab data to be crawled and separate with specific symbol.
In scheme disclosed above, preferably, described step (A2) farther includes: read the URL of its all target pages preserved from described 3rd configuration file, and generate target pages regular expression according to the common feature of all URL read and structure and predetermined URL regular expression create-rule therewith, wherein, described target pages regular expression and catalog page regular expression constitute described catalog page and target pages filter.
In scheme disclosed above, preferably, described step (A2) farther includes: the seed URL arranged from described first configuration file, N number of page is crawled in the way of most preferably preferentially combining by breadth First, subsequently catalog page therein is clustered according to clustering rule, to generate catalog page regular expression, wherein catalog page is defined as the number the pointing to target pages URL page more than N.
In scheme disclosed above, it is preferable that described clustering rule includes: incorporated into by URL identical for the host part from url list as same subset, and each subset constitutes one URL bunch;Therewith by each URL bunch of input polymerizer, to aggregate into multiple URL class by described URL bunch, each class URL comprises the URL that several are more similar;Utilize described URL regular expression create-rule, each the URL class produced after every bunch of URL polymerization is generated a URL regular expression respectively.
In scheme disclosed above, preferably, described step (A2) farther includes to generate target pages path template as follows: the URL reading front two articles of target pages from described 3rd configuration file the html file indicated download, the Element generation tree structure being subsequently based in said two html file, take out node identical in two trees therewith, being consequently formed a new tree, described newly-generated tree constitutes the target pages path template of this page type.
In scheme disclosed above, preferably, described step (A2) farther includes to determine required label exact path as follows: after determining described target pages path template, read out, from described 5th configuration file, the label that each need crawl successively, and with the label substance of described label for according to the leaf node first traveling through the described tree preserved, if have found the content of text of a leaf node or property value being equal with this label substance or comprise, then the path of this leaf node and this node is saved in corresponding queue, if leaf node does not has required label, the most bottom-up traversal non-leaf nodes completes same action.Wherein, often during one node of backtracking, if this node does not has node of the same name in the same layer brotgher of node, in node path, then directly preserve its node name, and if same layer has node of the same name, then need the property value first determining whether to distinguish other nodes of the same name, if had, this attribute-name and property value is preserved while preserving this node name the most in the paths, if node of the same name does not has such attribute, then need to preserve this node of the same name at this layer in relative position, in order to distinguish other nodes and can accurately find this node.
In scheme disclosed above, preferably, described step (A3) farther includes: the linking inlet ports preserved from described first configuration file carries out crawling operation in the way of most preferably preferentially combining by breadth First, and utilize described catalog page and target pages filter filter out unwanted URL and retain the URL of catalog page to be analyzed and target pages, wherein, the page of the catalog page to be analyzed of described reservation and the URL instruction of target pages meets described target pages regular expression and/or catalog page regular expression, and the page indicated by described URL all comprises at least one in the one or more key word;The target labels data of the page indicated by URL of target pages to be analyzed based on the described required label exact path described reservation of analysis label required for obtaining.
In scheme disclosed above, it is preferable that described step (A3) farther includes: store described target labels data and operate, with crawling, the abnormal data being associated.
Searching network data method based on reptile pattern disclosed in this invention has the advantage that have higher versatility and searching accuracy and easy to use.
Accompanying drawing explanation
Will be more fully understood that by those skilled in the art, wherein in conjunction with accompanying drawing, the technical characteristic of the present invention and advantage:
Fig. 1 is the flow chart of based on reptile pattern according to an embodiment of the invention searching network data method.
Detailed description of the invention
Fig. 1 is the flow chart of based on reptile pattern according to an embodiment of the invention searching network data method.As it is shown in figure 1, searching network data method based on reptile pattern disclosed in this invention comprises the following steps: that (A1) performs configuration operation to arrange configuration file;(A2) initialization operation is performed based on described configuration file, to generate catalog page based on regular expression and target pages filter and target pages path template;(A3) perform formal to crawl operation, to obtain from the target web confirmed and to store data based on described catalog page and target pages filter and target pages template.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the first configuration file, described first configuration file is used for preserving seed URL, described seed URL is the linking inlet ports (exemplarily, described linking inlet ports is the initial page relevant to target pages or catalog page) crawling operation described in trigger.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the second configuration file, described second configuration file is for preserving the one or more key words being associated with target pages, in order to auxiliary needle, the confirmation of target pages is operated (exemplarily, described key word is the key word that can generally position target pages content finding one or several nearly all intended target page all to exist in the source file of purpose target pages), wherein, described second configuration file is further used for defining the logical relation between each key word (such as, "AND", "or" etc.).
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the 3rd configuration file, and described 3rd configuration file is for preserving the URL of the target pages that at least two has been confirmed by user.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the 4th configuration file, and described 4th configuration file is used for preserving basic configuration information (such as domain name etc.) and other auxiliary configuration information (such as the pre-maximum crawling operation for initial phase crawls quantity etc. in advance).
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: arrange the 5th configuration file, described 5th configuration file is for preserving the one or more labels that need to crawl, wherein said one or more label comprises the label of first indicated webpage in the URL arranged in described 3rd configuration file and content thereof (exemplarily, bookmark name can be by user according to data self, label substance then needs to replicate from target pages, or manually input, but the data mode preserved necessarily keeps consistent with the data on the page).
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: when the label substance of the one or more labels in one or more labels that described need crawl is the unfixed many data of quantity (such as comment), then preserve any two in the unfixed many data of described quantity in the label substance of its correspondence.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A1) farther includes: when the link that the label substance of the one or more labels in one or more labels that described need crawl is correlator labels page, then add " subtab " labelling in the relevant position of described 5th configuration file, the subtab of " subtab " labelling lock instruction should preserve in the link of the subtab page and this subtab page subtab data to be crawled and separate with specific symbol.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes: read the URL of its all target pages preserved from described 3rd configuration file, and generate target pages regular expression according to the common feature of all URL read and structure and predetermined URL regular expression create-rule therewith, wherein, described target pages regular expression and catalog page regular expression constitute described catalog page and target pages filter.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes: the seed URL arranged from described first configuration file, with breadth First and optimal preferential (the most such as, optimal mode of priority can be realized by the judgement of the setting of URL weights and key word) mode that combines crawls N(such as, N=10000) the individual page, subsequently catalog page therein is clustered according to clustering rule, to generate catalog page regular expression, wherein be defined as catalog page pointing to the number of target pages URL more than N(such as N=5) the page.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes: any URL is divided into tri-parts of host, path and query, wherein path part is made up of a series of catalogues, query part is made up of (such as, if url is http://news.qq.com/a/20150415/044667.htm a series of key-value pairs?Tu_biz=1.114.1.0&du=1, then host part is news.qq.com;Path part is /a/20150415/044667.htm;And query part is tu_biz=1.114.1.0&du=1), thus, described URL regular expression create-rule is as follows:
Parametric form Create-rule
Host part Directly retain former host string
Identical with the parameter of apoplexy due to endogenous wind co-located (except pure digi-tal) Directly retain original shape formula
It is all letter with the parameter of apoplexy due to endogenous wind co-located [a-zA-Z]+
It is all pure digi-tal with the parameter of apoplexy due to endogenous wind co-located [0-9]+
With the date parameter that parameter is this type of apoplexy due to endogenous wind co-located, eg:2015-04-27 [0-9]+-[0-9]+-[0-9]+
With apoplexy due to endogenous wind co-located parameter by letter, numeral, "_" combine \w+
With apoplexy due to endogenous wind co-located parameter by letter, numeral, "_”、“-", ". " combine [-\w\.]+
Path followed by (.[a-zA-Z]+)?/?
When above form is not mated, use asterisk wildcard .*
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described clustering rule includes: incorporated into by URL identical for the host part from url list as same subset, and each subset constitutes one URL bunch;Therewith by each URL bunch of input polymerizer, described URL bunch to be aggregated into multiple URL class, each class URL comprises the distance between several more similar URL(i.e. these URL and is all not more than the threshold value preset, for example, it is possible to utilize the distance that the distance (i.e. similarity degree) between the various piece in tri-parts of host, path and query calculates between URL);Utilize described URL regular expression create-rule, each the URL class produced after every bunch of URL polymerization is generated a URL regular expression respectively.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes to generate target pages path template as follows: the URL reading front two articles of target pages from described 3rd configuration file the html file indicated download, the Element generation tree structure being subsequently based in said two html file, take out node identical in two trees therewith, being consequently formed a new tree, described newly-generated tree constitutes the target pages path template of this page type.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A2) farther includes to determine required label exact path (i.e. general purpose routine template) as follows: after determining described target pages path template, read out, from described 5th configuration file, the label that each need crawl successively, and with the label substance of described label for according to the leaf node first traveling through the described tree preserved, if have found the content of text of a leaf node or property value being equal with this label substance or comprise, then the path of this leaf node and this node is saved in corresponding queue, if leaf node does not has required label, the most bottom-up traversal non-leaf nodes completes same action.Wherein, often during one node of backtracking, if this node does not has node of the same name in the same layer brotgher of node, in node path, then directly preserve its node name, and if same layer has node of the same name, then need the property value first determining whether to distinguish other nodes of the same name, if had, this attribute-name and property value is preserved while preserving this node name the most in the paths, if node of the same name does not has such attribute, then need to preserve this node of the same name at this layer in relative position, in order to distinguish other nodes and can accurately find this node.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A3) farther includes: the linking inlet ports preserved from described first configuration file with breadth First and optimal preferential (the most such as, optimal mode of priority can be realized by the judgement of the setting of URL weights and key word) mode that combines carries out crawling operation, and utilize described catalog page and target pages filter filter out unwanted URL and retain the URL of catalog page to be analyzed and target pages, wherein, the page of the catalog page to be analyzed of described reservation and the URL instruction of target pages meets described target pages regular expression and/or catalog page regular expression, and the page indicated by described URL all comprises at least one in the one or more key word;The target labels data (i.e. label substance) of the page indicated by URL of target pages to be analyzed based on the described required label exact path described reservation of analysis label required for obtaining.
Preferably, in searching network data method based on reptile pattern disclosed in this invention, described step (A3) farther includes: stores described target labels data and operates, with crawling, the abnormal data being associated.
Therefore, searching network data method based on reptile pattern disclosed in this invention has the advantage that have higher versatility and searching accuracy and easy to use.
Although the present invention is to be described by above-mentioned preferred implementation, but its way of realization is not limited to above-mentioned embodiment.It will be appreciated that in the case of without departing from spirit and scope of the present invention, the present invention can be made different changing and modifications by those skilled in the art.

Claims (15)

1. a searching network data method based on reptile pattern, described searching network data method based on reptile pattern comprises the following steps:
(A1) configuration operation is performed to arrange configuration file;
(A2) initialization operation is performed based on described configuration file, to generate catalog page based on regular expression and target pages filter and target pages path template;
(A3) perform formal to crawl operation, to obtain from the target web confirmed and to store data based on described catalog page and target pages filter and target pages template.
Searching network data method based on reptile pattern the most according to claim 1, it is characterized in that, described step (A1) farther includes: arrange the first configuration file, it is the linking inlet ports crawling operation described in trigger that described first configuration file is used for preserving seed URL, described seed URL.
Searching network data method based on reptile pattern the most according to claim 2, it is characterized in that, described step (A1) farther includes: arrange the second configuration file, described second configuration file is for preserving the one or more key words being associated with target pages, in order to auxiliary needle, the confirmation of target pages is operated, wherein, described second configuration file is further used for defining the logical relation between each key word.
Searching network data method based on reptile pattern the most according to claim 3, it is characterized in that, described step (A1) farther includes: arrange the 3rd configuration file, and described 3rd configuration file is for preserving the URL of the target pages that at least two has been confirmed by user.
Searching network data method based on reptile pattern the most according to claim 4, it is characterized in that, described step (A1) farther includes: arrange the 4th configuration file, and described 4th configuration file is used for preserving basic configuration information and other auxiliary configuration informations.
Searching network data method based on reptile pattern the most according to claim 5, it is characterized in that, described step (A1) farther includes: arrange the 5th configuration file, described 5th configuration file is for preserving the one or more labels that need to crawl, and wherein said one or more labels comprise label and the content thereof of first indicated webpage in the URL arranged in described 3rd configuration file.
Searching network data method based on reptile pattern the most according to claim 6, it is characterized in that, described step (A1) farther includes: when the label substance of the one or more labels in one or more labels that described need crawl is the unfixed many data of quantity, then preserve any two in the unfixed many data of described quantity in the label substance of its correspondence.
Searching network data method based on reptile pattern the most according to claim 7, it is characterized in that, described step (A1) farther includes: when the link that the label substance of the one or more labels in one or more labels that described need crawl is correlator labels page, then add " subtab " labelling in the relevant position of described 5th configuration file, the subtab of " subtab " labelling lock instruction should preserve in the link of the subtab page and this subtab page subtab data to be crawled and separate with specific symbol.
Searching network data method based on reptile pattern the most according to claim 8, it is characterized in that, described step (A2) farther includes: read the URL of its all target pages preserved from described 3rd configuration file, and generate target pages regular expression according to the common feature of all URL read and structure and predetermined URL regular expression create-rule therewith, wherein, described target pages regular expression and catalog page regular expression constitute described catalog page and target pages filter.
Searching network data method based on reptile pattern the most according to claim 9, it is characterized in that, described step (A2) farther includes: the seed URL arranged from described first configuration file, N number of page is crawled in the way of most preferably preferentially combining by breadth First, subsequently catalog page therein is clustered according to clustering rule, to generate catalog page regular expression, wherein catalog page is defined as the number the pointing to target pages URL page more than N.
11. searching network data methods based on reptile pattern according to claim 10, it is characterised in that described clustering rule includes: URL identical for the host part from url list is incorporated into into same subset, and each subset constitutes one URL bunch;Therewith by each URL bunch of input polymerizer, to aggregate into multiple URL class by described URL bunch, each class URL comprises the URL that several are more similar;Utilize described URL regular expression create-rule, each the URL class produced after every bunch of URL polymerization is generated a URL regular expression respectively.
12. searching network data methods based on reptile pattern according to claim 11, it is characterized in that, described step (A2) farther includes to generate target pages path template as follows: the URL reading front two articles of target pages from described 3rd configuration file the html file indicated download, the Element generation tree structure being subsequently based in said two html file, take out node identical in two trees therewith, being consequently formed a new tree, described newly-generated tree constitutes the target pages path template of this page type.
null13. searching network data methods based on reptile pattern according to claim 12,It is characterized in that,Described step (A2) farther includes to determine required label exact path as follows: after determining described target pages path template,Read out, from described 5th configuration file, the label that each need crawl successively,And with the label substance of described label for according to the leaf node first traveling through the described tree preserved,If have found the content of text of a leaf node or property value being equal with this label substance or comprise,Then the path of this leaf node and this node is saved in corresponding queue,If leaf node does not has required label,The most bottom-up traversal non-leaf nodes completes same action,Wherein,Often during one node of backtracking,If this node does not has node of the same name in the same layer brotgher of node,In node path, then directly preserve its node name,And if same layer having node of the same name,Then need the property value first determining whether to distinguish other nodes of the same name,If had,This attribute-name and property value is preserved while preserving this node name the most in the paths,If node of the same name does not has such attribute,Then need to preserve this node of the same name at this layer in relative position,To distinguish other nodes and can accurately finding this node.
14. searching network data methods based on reptile pattern according to claim 13, it is characterized in that, described step (A3) farther includes: the linking inlet ports preserved from described first configuration file carries out crawling operation in the way of most preferably preferentially combining by breadth First, and utilize described catalog page and target pages filter filter out unwanted URL and retain the URL of catalog page to be analyzed and target pages, wherein, the page of the catalog page to be analyzed of described reservation and the URL instruction of target pages meets described target pages regular expression and/or catalog page regular expression, and the page indicated by described URL all comprises at least one in the one or more key word;The target labels data of the page indicated by URL of target pages to be analyzed based on the described required label exact path described reservation of analysis label required for obtaining.
15. searching network data methods based on reptile pattern according to claim 14, it is characterised in that described step (A3) farther includes: store described target labels data and operate, with crawling, the abnormal data being associated.
CN201610242327.9A 2016-04-19 2016-04-19 Network data searching method based on crawler mode Pending CN105955984A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610242327.9A CN105955984A (en) 2016-04-19 2016-04-19 Network data searching method based on crawler mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610242327.9A CN105955984A (en) 2016-04-19 2016-04-19 Network data searching method based on crawler mode

Publications (1)

Publication Number Publication Date
CN105955984A true CN105955984A (en) 2016-09-21

Family

ID=56917736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610242327.9A Pending CN105955984A (en) 2016-04-19 2016-04-19 Network data searching method based on crawler mode

Country Status (1)

Country Link
CN (1) CN105955984A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528646A (en) * 2016-10-13 2017-03-22 中国农业大学 Method for acquiring pesticide data and server
CN106547824A (en) * 2016-09-29 2017-03-29 北京奇艺世纪科技有限公司 One kind crawls paths planning method and device
CN107016102A (en) * 2017-04-12 2017-08-04 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging collocation method
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN108228623A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of data processing method and client device
CN108446356A (en) * 2018-03-12 2018-08-24 上海哔哩哔哩科技有限公司 Data cache method, server and data buffering system
CN108984568A (en) * 2017-06-04 2018-12-11 北京询达数据科技有限公司 A kind of design method of full-automatic internet darknet search engine
CN109255063A (en) * 2018-08-01 2019-01-22 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus crawling web page contents
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN103544176A (en) * 2012-07-13 2014-01-29 百度在线网络技术(北京)有限公司 Method and device for generating page structure template corresponding to multiple pages

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN103544176A (en) * 2012-07-13 2014-01-29 百度在线网络技术(北京)有限公司 Method and device for generating page structure template corresponding to multiple pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
叶勤勇: "基于URL规则的聚焦爬虫及其应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547824B (en) * 2016-09-29 2019-11-15 北京奇艺世纪科技有限公司 One kind crawling paths planning method and device
CN106547824A (en) * 2016-09-29 2017-03-29 北京奇艺世纪科技有限公司 One kind crawls paths planning method and device
CN106528646A (en) * 2016-10-13 2017-03-22 中国农业大学 Method for acquiring pesticide data and server
CN108228623B (en) * 2016-12-14 2021-12-24 北京国双科技有限公司 Data processing method and client device
CN108228623A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of data processing method and client device
CN107016102A (en) * 2017-04-12 2017-08-04 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging collocation method
CN107016102B (en) * 2017-04-12 2019-12-03 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging configuration method
CN108984568A (en) * 2017-06-04 2018-12-11 北京询达数据科技有限公司 A kind of design method of full-automatic internet darknet search engine
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN108446356A (en) * 2018-03-12 2018-08-24 上海哔哩哔哩科技有限公司 Data cache method, server and data buffering system
CN108446356B (en) * 2018-03-12 2023-08-29 上海哔哩哔哩科技有限公司 Data caching method, server and data caching system
CN109255063A (en) * 2018-08-01 2019-01-22 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus crawling web page contents
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN109902220B (en) * 2019-02-27 2023-11-24 腾讯科技(深圳)有限公司 Webpage information acquisition method, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN105955984A (en) Network data searching method based on crawler mode
JP5721818B2 (en) Use of model information group in search
CN100462969C (en) Method for providing and inquiry information for public by interconnection network
US20110282861A1 (en) Extracting higher-order knowledge from structured data
CN103034633B (en) Generate the method and device of the result of page searching summary of extension
US20110264651A1 (en) Large scale entity-specific resource classification
Gentile et al. Unsupervised wrapper induction using linked data
Dou et al. Automatically mining facets for queries from their search results
CN1983255A (en) Internet searching method
US20160103913A1 (en) Method and system for calculating a degree of linkage for webpages
CN103838798A (en) Page classification system and method
CN103530429A (en) Webpage content extracting method
Ujwal et al. Classification-based adaptive web scraper
CN107273483B (en) The access method and system of sparse data
US20120284224A1 (en) Build of website knowledge tables
CN109948015B (en) Meta search list result extraction method and system
Wanjari et al. Automatic news extraction system for Indian online news papers
Qiu et al. Big data linkage for product specification pages
CN102541913A (en) Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method
TWI547888B (en) A method of recording user information and a search method and a server
CN104063506A (en) Method and device for identifying repeated web pages
Thamviset et al. Bottom-up region extractor for semi-structured web pages
US20140279804A1 (en) Jabba-type contextual tagger
Liu et al. Research on adaptive wrapper in deep web data extraction
Sarkar Learning Spark SQL

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160921

RJ01 Rejection of invention patent application after publication