CN107943838A - A kind of automatic method and system for obtaining xpath generation reptile scripts - Google Patents

A kind of automatic method and system for obtaining xpath generation reptile scripts Download PDF

Info

Publication number
CN107943838A
CN107943838A CN201711034452.1A CN201711034452A CN107943838A CN 107943838 A CN107943838 A CN 107943838A CN 201711034452 A CN201711034452 A CN 201711034452A CN 107943838 A CN107943838 A CN 107943838A
Authority
CN
China
Prior art keywords
label
xpath
webpage
reptile
paths
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711034452.1A
Other languages
Chinese (zh)
Other versions
CN107943838B (en
Inventor
姬永杰
陈国强
王长勇
任建新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Great Digital Science And Technology Development Co Ltd
Original Assignee
Beijing Great Digital Science And Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Great Digital Science And Technology Development Co Ltd filed Critical Beijing Great Digital Science And Technology Development Co Ltd
Priority to CN201711034452.1A priority Critical patent/CN107943838B/en
Publication of CN107943838A publication Critical patent/CN107943838A/en
Application granted granted Critical
Publication of CN107943838B publication Critical patent/CN107943838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of automatic method and system for obtaining xpath generation reptile scripts, the described method comprises the following steps:(1) webpage is opened by url addresses, traveled through all in webpage<a>Label;(2) take out each<a>The corresponding xpath paths of label;(3) it is divided into one group by xpath paths are identical;Then, after statistical packet<a>Label number;(4) one in every group is taken out<a>Label, opens the linked web pages;(5) for the webpage being each opened in step 4, in statistical web page<a>Label number and word number;(6) take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;(7) Scrapy frames are based on, according to corresponding xpath coordinates measurements reptile script.This method can crawl government website public information based on Scrapy frames, can parse the xpath paths of required content in webpage, the automatization level of lifting reptile management automatically.

Description

A kind of automatic method and system for obtaining xpath generation reptile scripts
Technical field
The present invention relates to web crawlers technical field, and in particular to a kind of automatic side for obtaining xpath generation reptile scripts Method and system.The xpath refers to xpath paths.
Background technology
Dynamics is opened with data as further increase government information is disclosed, more and more government informations are disclosed in The government website public information of magnanimity is formed in government website.Existing government website is established by all kinds of departments of governments at all levels With maintenance, management, obtain government website public information easily and fast from government website, will be brought to user huge Value.
But the content of these government websites is different, the configurations of webpage, present internet reptile (is also known as net Network reptile, is referred to as reptile), when being crawled to government website, it is required for professional technician to analyze the structure of webpage, from And content and it is crawled needed for positioning, this is because:
The xpath paths of required content are different in webpage, and manually content needed for parsing is needed when being crawled Xpath paths, this obviously will be devoted a tremendous amount of time and manpower, heavy workload, and work is cumbersome.In face of thousands of political affairs Mansion website, this pattern are obviously less efficient.
The present invention relates to following technical term:
1st, crawl, refer to access website, and information is obtained from webpage, realize collecting webpage data.
2nd, xpath, is the language (reptile) that information is searched in the webpage (especially XML document), for webpage (especially XML document) in element and attribute traveled through.Xpath belongs to Html path languages, it may be employed to determine Html texts The language of certain portion in shelves.
3rd, Scrapy, is a quick, high-level screen scraping and web crawl frame (the reptile frame of Python exploitations Frame), for capturing web site and the data of structuring being extracted from webpage.Scrapy is widely used, can be used for data digging Pick, monitoring and automatic test.In the reptile script based on Scrapy frames, the step of most critical is institute in identification webpage The xpath paths of content are needed, to crawl named web page content.
4th, internet reptile, is a kind of program or foot according to certain rule, automatically crawl web message This, it is mainly two ways:
The first is that the whole network of the search engines such as similar Baidu crawls;
It is for second to be crawled towards the orientation of certain classification, orientation, which crawls, to be referred to crawl named web page content (named web page Targeted content).
But for the mode that orientation crawls, as previously described, because the page layout of government website is more mixed and disorderly, obtain The xpath paths of named web page content (required content in webpage), it is necessary to professional technician in the url addresses of existing webpage Under the premise of, check web page source code, after analysis, obtain correct xpath paths.
The content of the invention
For defect existing in the prior art, it is an object of the invention to provide a kind of automatic acquisition xpath generations to climb The method and system of worm script, can crawl government website public information based on Scrapy frames by this method, can To parse the xpath paths of required content in webpage, the automatization level of lifting reptile management automatically.
To achieve the above object, the technical solution adopted by the present invention is as follows:
A kind of automatic method for obtaining xpath generation reptile scripts, comprises the following steps:
Step 1, the url addresses of webpage are obtained, and webpage is opened by url addresses, are traveled through all in webpage<a>Mark Label;
It is described<a>Label is used to define hyperlink;
Step 2, take out each<a>The corresponding xpath paths of label;
Step 3, by following principle pair<a>Label is grouped:Xpath paths are identical, are divided into one group;Then, unite After score group<a>Label number;
Step 4, one in every group is taken out<a>Label, opens the linked web pages;
Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word Number;
The word number refers to<a>The word number of label;
Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;
Step 7, based on Scrapy frames, according to corresponding xpath coordinates measurements reptile script.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, in step 1, the net Page is the webpage for including subject information list.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 2 For:
Step 2.1, by jsoup bags, obtain<a>The parent label of label;
Step 2.2, recursive call, i.e., obtain its parent label again to each parent label;
Step 2.3, until the parent label of acquisition is<html>Terminate;
Step 2.4, all parent labels of acquisition are sequentially connected, are somebody's turn to do<a>The xpath paths of label.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 5 For:Owned by jsoup<a>Label, and corresponding text () content, and count<a>Label number and text () are interior The word number of appearance.
Further, a kind of automatic method for obtaining xpath generation reptile scripts as described above, the specific steps of step 7 For:The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to Scrapy frames, generation is corresponding Scrapy reptile scripts.
A kind of automatic system for obtaining xpath generation reptile scripts for being used for realization the above method, including:
<a>Label spider module, for opening the corresponding webpage in url addresses, and travels through all in webpage<a>Label;
Xpath path-generating modules, it is each for taking out<a>The corresponding xpath paths of label;
Labeled packet module, for pair<a>Label is grouped, after statistical packet<a>Label number;
Linked web pages acquisition module, for basis<a>Label opens the linked web pages;
Information Statistics module, in statistical web page<a>Label number and word number;
Xpath paths discrimination module, for analyze word number at most and<a>The minimum packet of label number corresponds to Xpath paths;
Reptile script generation module, for according to corresponding xpath coordinates measurements reptile script.
Further, a kind of automatic system for obtaining xpath generation reptile scripts as described above, the reptile script life Into module, based on Scrapy frames, according to the url addresses and xpath paths of webpage, corresponding Scrapy reptiles script is generated.
The beneficial effects of the present invention are:By this method, in the case where being not required professional technician to participate in, need to only pass through The url addresses of webpage can obtain xpath paths automatically.
By this method, the xpath paths of required content in automatic positioning webpage are realized, reptile script is generated, makes to climb Scolite reason is simplified, automation;By reptile script must be write by professional technician originally, nearly hundred can only be safeguarded within one month A reptile script, being lifted to an ordinary person can rely on this method to safeguard thousands of a reptile scripts, greatly improve work Efficiency.
Brief description of the drawings
Fig. 1 is a kind of automatic method for obtaining xpath and generating reptile script provided in the specific embodiment of the invention Flow chart;
Fig. 2 is a kind of automatic system for obtaining xpath and generating reptile script provided in the specific embodiment of the invention Structure diagram.
Fig. 3 is certain government website webpage example.
Fig. 4 is in webpage shown in Fig. 3<a>Label corresponds to webpage and opens schematic diagram.
Embodiment
With reference to Figure of description, the present invention is described in further detail with embodiment.
Scrapy is the reptile frame of current mainstream, it needs to obtain required content in webpage and (needs to be crawled interior Hold, for example, subject information list) Html unique tags, that is, xpath paths, can be crawled.Therefore, mesh of the invention One of be the xpath paths that can automatically identify in webpage where subject information list, only in subject information list Information is crawled, and filters out the other information in webpage.
The method of the invention main thought is:Comparatively government website, includes the webpage of government website public information Structure be to have certain intercommunity.A basic webpage for including government website public information, mostly includes:Menu is led Boat, notice bulletin, subject information list, other navigation, the content such as advertisement and other links.But different webpages, theme The corresponding xpath paths of information list are different from, the present invention provide following method can Automatic sieve select subject information list Xpath paths, automatically generate reptile script, and the information in subject information list is crawled.
Fig. 1 shows a kind of automatic side for obtaining xpath generation reptile scripts provided in the specific embodiment of the invention The flow chart of method, this method mainly include the following steps that:
Step 1, the url addresses of webpage are obtained, and webpage is opened by url addresses, are traveled through all in webpage<a>Mark Label;
It is described<a>Label is used to define hyperlink;
The webpage is the webpage for including subject information list, and the url addresses of such webpage, can artificially collect arrangement, It can also be preset according to the framework of different government websites;
It is all in the traversal webpage<a>The specific algorithm of label is as follows:The method for calling jsoup bags, obtains webpage In it is all<a>Label and its content;Jsoup is the bag of an analyzing web page, with java exploitations, there is provided similar DOM, The content in document is searched and extracted to the mode of CSS selector;
Step 2, take out each<a>The corresponding xpath paths of label;
Specific algorithm is as follows:
Step 2.1, by jsoup bags, obtain<a>The parent label of label;
Step 2.2, recursive call, i.e., obtain its parent label again to each parent label;
Step 2.3, until the parent label of acquisition is<html>Terminate;
Step 2.4, all parent labels of acquisition are sequentially connected, are somebody's turn to do<a>The xpath paths of label;
Step 3, by following principle pair<a>Label is grouped:Xpath paths are identical, are divided into one group;I.e.:Will Xpath paths are identical<a>Label is divided into one group;Then, after statistical packet<a>Label number;
Step 4, one in every group is taken out<a>Label, opens the linked web pages;
Due to the packet mode of step 3, in any one group<a>Label, the linked web pages opened all are identical 's;
Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word Number;
The word number refers to<a>The word number of label;
Specific algorithm is as follows:Owned by jsoup<a>Label, and corresponding text () content, and count<a>Mark Sign the word number of number and text () content;
<a>Label number is<a>The number of label;
Word number, that is, word number of words;
Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;
Step 7, based on Scrapy frames, according to corresponding xpath coordinates measurements reptile script, the reptile script refers to Scrapy reptile scripts;
Specific algorithm is as follows:The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to Scrapy frames, generate corresponding Scrapy reptiles script;
Scrapy reptiles script only needs the url addresses and xpath paths of webpage, and other contents are substantially stationary constant, institute Scrapy reptile scripts can be generated by Scrapy frames need to only obtain url and xpath.
The method of the invention, can be public comprising government website by 80% by the verification of a government's class webpages up to a hundred The webpage of information is opened, automatically analyzes and obtains the corresponding xpath paths of subject information list, so as to generate reptile script, the party Method is averagely 1 minute or so time-consuming, it was demonstrated that method is feasible and efficiency is higher.
But truly having the framework of number of site more special, this method or the acquisition xpath paths for being unable to entirely accurate can With according to the concrete condition that runs into, the characteristics of by the framework of summarizing these websites, step 5,6 algorithm are improved, reaches accurate Obtain the purpose in xpath paths.Such as:Some web page listings link only several, and more, this kind of spy is compared in other links Different situation in the range of the present invention discusses, is not described in detail no longer.
It is a specific embodiment below.
As shown in figure 3, be the webpage example of a government website, it is visible in the example:
The menu navigation at top,
Other navigation in left side,
The advertisement of lower part and other links,
The subject information list at middle part.
The url addresses of the webpage are known.
Comprise the following steps that:
Step 1, after opening the webpage, travel through all in webpage<a>Label;
Have in the webpage following<a>Label:
<Training is passed through in a href="/defaults/news/news/nid/5701 " title=" The 2nd Foreign Language Inst. of Beijings Institute's five in one Multifunctional laboratory builds government procurement successful project bidding bulletin ">
<A href="/defaults/news/news/nid/5693 " title=" Beijing prison reorganization and expansion security protection system Construction in a systematic way sets phase government procurement project video monitoring system special project equipment purchase acceptance of the bid bulletin ">
<A href="/defaults/news/news/nid/5692 " title=" Beijing Municipal People's Governments state-owned assets In contributing enterprise of supervision and management committee 2017-2019 annual accounts audit advisory, quality examination and the procurement item that runs a government Mark bulletin ">
(label is excessive, does not enumerate all)
Step 2, take out each<a>The corresponding xpath paths of label;Respectively<a>The corresponding xpath paths of label are as follows:
// * [@id=" nav_right "]/ul [2]/li [1]/a
// * [@id=" newslist "]/ul/li [1]/span [1]/a
// * [@id=" newstype_list "]/dl/dt [1]/a
// * [@id=" footer "]/div [2]/ul/li [1]/a
Step 3, by identical pair in xpath paths<a>Label is grouped;Group result such as following table:
Web page contents Xpath paths <a>Label number
Menu navigation // * [@id=" nav_right "]/ul [2]/li [1]/a 6
Subject information list // * [@id=" newslist "]/ul/li [1]/span [1]/a 15
Other navigation // * [@id=" newstype_list "]/dl/dt [1]/a 5
Advertisement and other links // * [@id=" footer "]/div [2]/ul/li [1]/a 5
Step 4, one in every group is taken out<a>Label, opens the linked web pages, has four points in this specific embodiment Group, then four corresponding webpages as shown in figure 4,
Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word Number;Statistical result such as following table:
Web page contents <a>Label number Webpage word number
Menu navigation 42 222
Subject information list 33 1607
Other navigation 59 836
Advertisement and other links 44 283
Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;
According to upper table, meet " word number at most and<a>Label number is minimum " the information list that is the theme it is corresponding Xpath paths;
Step 7, according to the corresponding xpath coordinates measurements reptile script of subject information list.
It can be seen from the above that this pass through every group<a>The webpage word number of label with<a>Label number judges subject information The xpath paths of list, participate in identifying, ordinary person can complete reptile management without professional technician.
It is corresponding with the method shown in Fig. 1, a kind of automatic xpath that obtains is additionally provided in embodiment of the present invention and is given birth to Into the system of reptile script, as shown in Fig. 2, the system includes:
<a>Label spider module, for opening the corresponding webpage in url addresses, and travels through all in webpage<a>Label;
Xpath path-generating modules, it is each for taking out<a>The corresponding xpath paths of label;
Labeled packet module, for pair<a>Label is grouped, after statistical packet<a>Label number;
Linked web pages acquisition module, for basis<a>Label opens the linked web pages;
Information Statistics module, in statistical web page<a>Label number and word number;
Xpath paths discrimination module, for analyze word number at most and<a>The minimum packet of label number corresponds to Xpath paths;
Reptile script generation module, for according to corresponding xpath coordinates measurements reptile script.
Based on the above technical solutions, the reptile script generation module, based on Scrapy frames, according to webpage Url addresses and xpath paths, generate corresponding Scrapy reptiles script.
Obviously, various changes and modifications can be made to the invention without departing from the present invention's by those skilled in the art Spirit and scope.In this way, if these modifications and changes of the present invention belongs to the model of the claims in the present invention and its equivalent technology Within enclosing, then the present invention is also intended to comprising including these modification and variations.

Claims (7)

1. a kind of automatic method for obtaining xpath generation reptile scripts, comprises the following steps:
Step 1, the url addresses of webpage are obtained, and webpage is opened by url addresses, are traveled through all in webpage<a>Label;
It is described<a>Label is used to define hyperlink;
Step 2, take out each<a>The corresponding xpath paths of label;
Step 3, by following principle pair<a>Label is grouped:Xpath paths are identical, are divided into one group;Then, statistical packet Afterwards<a>Label number;
Step 4, one in every group is taken out<a>Label, opens the linked web pages;
Step 5, for the webpage being each opened in step 4, in statistical web page<a>Label number and word number;
The word number refers to<a>The word number of label;
Step 6, take out word number at most and<a>Minimum one group of label number, records its corresponding xpath path;
Step 7, based on Scrapy frames, according to corresponding xpath coordinates measurements reptile script.
A kind of 2. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that:Step 1 In, the webpage is the webpage for including subject information list.
A kind of 3. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that:Step 2 Concretely comprise the following steps:
Step 2.1, by jsoup bags, obtain<a>The parent label of label;
Step 2.2, recursive call, i.e., obtain its parent label again to each parent label;
Step 2.3, until the parent label of acquisition is<html>Terminate;
Step 2.4, all parent labels of acquisition are sequentially connected, are somebody's turn to do<a>The xpath paths of label.
A kind of 4. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that:Step 5 Concretely comprise the following steps:Owned by jsoup<a>Label, and corresponding text () content, and count<a>Label number and The word number of text () content.
A kind of 5. automatic method for obtaining xpath generation reptile scripts according to claim 1, it is characterised in that:Step 7 Concretely comprise the following steps:The xpath paths that the url addresses of webpage in step 1 and step 6 record are sent to Scrapy frames, Generate corresponding Scrapy reptiles script.
6. a kind of automatic system for obtaining xpath generation reptile scripts, including:
<a>Label spider module, for opening the corresponding webpage in url addresses, and travels through all in webpage<a>Label;
Xpath path-generating modules, it is each for taking out<a>The corresponding xpath paths of label;
Labeled packet module, for pair<a>Label is grouped, after statistical packet<a>Label number;
Linked web pages acquisition module, for basis<a>Label opens the linked web pages;
Information Statistics module, in statistical web page<a>Label number and word number;
Xpath paths discrimination module, for analyze word number at most and<a>The minimum packet of label number is corresponding Xpath paths;
Reptile script generation module, for according to corresponding xpath coordinates measurements reptile script.
A kind of 7. automatic system for obtaining xpath generation reptile scripts according to claim 6, it is characterised in that:It is described Reptile script generation module, based on Scrapy frames, according to the url addresses and xpath paths of webpage, generates corresponding Scrapy Reptile script.
CN201711034452.1A 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script Active CN107943838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711034452.1A CN107943838B (en) 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711034452.1A CN107943838B (en) 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script

Publications (2)

Publication Number Publication Date
CN107943838A true CN107943838A (en) 2018-04-20
CN107943838B CN107943838B (en) 2021-09-07

Family

ID=61936673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711034452.1A Active CN107943838B (en) 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script

Country Status (1)

Country Link
CN (1) CN107943838B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109246069A (en) * 2018-06-15 2019-01-18 华为技术有限公司 Webpage login method, device and readable storage medium storing program for executing
CN109657117A (en) * 2018-11-12 2019-04-19 厦门市美亚柏科信息股份有限公司 A kind of extraction method, system and the computer storage medium of webpage element
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy
CN111444407A (en) * 2020-03-26 2020-07-24 桂林理工大学 Automatic extraction method and system for page list information of web crawler
CN111460259A (en) * 2020-03-31 2020-07-28 腾讯科技(深圳)有限公司 Method and device for determining similar elements, computer equipment and storage medium
CN111831874A (en) * 2020-07-16 2020-10-27 平安国际智慧城市科技股份有限公司 Webpage data information acquisition method and device, computer equipment and storage medium
CN112099778A (en) * 2020-11-13 2020-12-18 北京智慧星光信息技术有限公司 Data acquisition method based on xpath, electronic equipment and storage medium
CN112417252A (en) * 2020-12-04 2021-02-26 天津开心生活科技有限公司 Crawler path determination method and device, storage medium and electronic equipment
CN114201971A (en) * 2021-12-13 2022-03-18 海南港航控股有限公司 Method and system for extracting character attributes from webpage

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103020156A (en) * 2012-11-23 2013-04-03 北京小米科技有限责任公司 Processing method, device and equipment for webpage
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN104142985A (en) * 2014-07-23 2014-11-12 哈尔滨工业大学(威海) Semi-automatic vertical crawler generation tool and method
CN104360882A (en) * 2014-11-07 2015-02-18 北京奇虎科技有限公司 Method and device for displaying images in web page in browser
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device
CN107016102A (en) * 2017-04-12 2017-08-04 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging collocation method
CN107066576A (en) * 2017-04-12 2017-08-18 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging system of selection and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103020156A (en) * 2012-11-23 2013-04-03 北京小米科技有限责任公司 Processing method, device and equipment for webpage
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN104142985A (en) * 2014-07-23 2014-11-12 哈尔滨工业大学(威海) Semi-automatic vertical crawler generation tool and method
CN104360882A (en) * 2014-11-07 2015-02-18 北京奇虎科技有限公司 Method and device for displaying images in web page in browser
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device
CN107016102A (en) * 2017-04-12 2017-08-04 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging collocation method
CN107066576A (en) * 2017-04-12 2017-08-18 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging system of selection and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
COSMOKEY: "自动生成Xpath小工具", 《CSDN》 *
肖恩部落: "Xpath获取html文档的标签", 《博客园》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109246069B (en) * 2018-06-15 2020-10-16 华为技术有限公司 Webpage login method and device and readable storage medium
CN109246069A (en) * 2018-06-15 2019-01-18 华为技术有限公司 Webpage login method, device and readable storage medium storing program for executing
CN109657117A (en) * 2018-11-12 2019-04-19 厦门市美亚柏科信息股份有限公司 A kind of extraction method, system and the computer storage medium of webpage element
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy
CN111444407A (en) * 2020-03-26 2020-07-24 桂林理工大学 Automatic extraction method and system for page list information of web crawler
CN111460259B (en) * 2020-03-31 2023-04-14 腾讯科技(深圳)有限公司 Method and device for determining similar elements, computer equipment and storage medium
CN111460259A (en) * 2020-03-31 2020-07-28 腾讯科技(深圳)有限公司 Method and device for determining similar elements, computer equipment and storage medium
CN111831874A (en) * 2020-07-16 2020-10-27 平安国际智慧城市科技股份有限公司 Webpage data information acquisition method and device, computer equipment and storage medium
CN111831874B (en) * 2020-07-16 2022-08-19 深圳赛安特技术服务有限公司 Webpage data information acquisition method and device, computer equipment and storage medium
CN112099778A (en) * 2020-11-13 2020-12-18 北京智慧星光信息技术有限公司 Data acquisition method based on xpath, electronic equipment and storage medium
CN112417252A (en) * 2020-12-04 2021-02-26 天津开心生活科技有限公司 Crawler path determination method and device, storage medium and electronic equipment
CN112417252B (en) * 2020-12-04 2023-05-09 天津开心生活科技有限公司 Crawler path determination method and device, storage medium and electronic equipment
CN114201971A (en) * 2021-12-13 2022-03-18 海南港航控股有限公司 Method and system for extracting character attributes from webpage

Also Published As

Publication number Publication date
CN107943838B (en) 2021-09-07

Similar Documents

Publication Publication Date Title
CN107943838A (en) A kind of automatic method and system for obtaining xpath generation reptile scripts
CN102054016B (en) For capturing and manage the system and method for community intelligent information
Sekine On-demand information extraction
CN105893583A (en) Data acquisition method and system based on artificial intelligence
CN103902889A (en) Malicious message cloud detection method and server
CN103605738A (en) Webpage access data statistical method and webpage access data statistical device
CN105335246B (en) A kind of program crashing defect self-repairing method based on question and answer web analytics
CN114462556B (en) Enterprise association industry chain classification method, training method, device, equipment and medium
CN107748782A (en) Query statement processing method and processing device
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN114860882A (en) Fair competition review auxiliary method based on text classification model
CN114648393A (en) Data mining method, system and equipment applied to bidding
CN107526833B (en) URL management method and system
CN112948664A (en) Method and system for automatically processing sensitive words
US8799791B2 (en) System for use in editorial review of stored information
CN109522494A (en) A kind of dark chain detection method, device, equipment and computer readable storage medium
CN112822210A (en) Vulnerability management system based on network assets
CN116562785B (en) Auditing and welcome system
CN109063485B (en) Vulnerability classification statistical system and method based on vulnerability platform
CN110472125B (en) Multistage page cascading crawling method and equipment based on web crawler
CN107239704A (en) Malicious web pages find method and device
CN114519163B (en) Incremental news URL extraction method based on regular matching and Bloom filter
CN111241141B (en) Rapid screening method for vehicle purchase tax monitoring management platform problem enterprises
CN113590597B (en) Identification method and equipment for analysis hierarchical division of key personnel of network abnormal behaviors
CN112398864B (en) Vertical web crawler detection and identification method based on behavior balance degree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Ji Yongjie

Inventor after: Chen Guoqiang

Inventor after: Ren Jianxin

Inventor before: Ji Yongjie

Inventor before: Chen Guoqiang

Inventor before: Wang Changyong

Inventor before: Ren Jianxin

GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method and system for automatically obtaining xpath to generate crawler scripts

Effective date of registration: 20231114

Granted publication date: 20210907

Pledgee: Beijing first financing Company limited by guarantee

Pledgor: BEIJING DASY TECHNOLOGY DEVELOPMENT CO.,LTD.

Registration number: Y2023110000472