CN107943838B - Method and system for automatically acquiring xpath generated crawler script - Google Patents

Method and system for automatically acquiring xpath generated crawler script Download PDF

Info

Publication number
CN107943838B
CN107943838B CN201711034452.1A CN201711034452A CN107943838B CN 107943838 B CN107943838 B CN 107943838B CN 201711034452 A CN201711034452 A CN 201711034452A CN 107943838 B CN107943838 B CN 107943838B
Authority
CN
China
Prior art keywords
webpage
tags
xpath
script
crawler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711034452.1A
Other languages
Chinese (zh)
Other versions
CN107943838A (en
Inventor
姬永杰
陈国强
任建新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dasy Technology Development Co ltd
Original Assignee
Beijing Dasy Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dasy Technology Development Co ltd filed Critical Beijing Dasy Technology Development Co ltd
Priority to CN201711034452.1A priority Critical patent/CN107943838B/en
Publication of CN107943838A publication Critical patent/CN107943838A/en
Application granted granted Critical
Publication of CN107943838B publication Critical patent/CN107943838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a system for automatically acquiring an xpath generated crawler script, wherein the method comprises the following steps: (1) opening a webpage through a url address, and traversing all < a > tags in the webpage; (2) taking out an xpath path corresponding to each < a > tag; (3) dividing the paths into a group according to the same xpath path; then, counting the number of the tags < a > after grouping; (4) taking out one < a > tag in each group and opening the linked webpage; (5) for each opened webpage in the step 4, counting the number of the < a > tags and the number of characters in the webpage; (6) taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path; (7) and generating a crawler script according to the corresponding xpath path based on the Scapy framework. The method can be used for crawling the public information of the government website based on the script framework, can automatically analyze the xpath path of the required content in the webpage, and improves the automation level of crawler management.

Description

Method and system for automatically acquiring xpath generated crawler script
Technical Field
The invention relates to the technical field of web crawlers, in particular to a method and a system for automatically acquiring an xpath generated crawler script. The xpath refers to an xpath path.
Background
As government information disclosure and data opening efforts are further increased, more and more government information is disclosed on government websites to form a huge amount of government website public information. The existing government website is established, maintained and managed by various departments in each government, and government website public information is conveniently and quickly obtained from the government website, so that huge value is brought to users.
However, the content of these government websites is different, the structure of the web pages is different, and when crawling the government websites, professional technicians are required to analyze the structure of the web pages to locate and crawl the required content, because:
the xpath paths of the required contents in the web page are different, and the xpath paths of the required contents need to be manually analyzed when crawling is performed, so that a great deal of time and labor are obviously spent, the workload is large, and the labor is complex. This model is clearly less efficient in the face of thousands of government web sites.
The present invention relates to the following technical terms:
1. crawling refers to accessing a website, acquiring information from a webpage, and realizing webpage data acquisition.
2. xpath, a language (crawler) that looks up information in a web page (especially an XML document), is used to traverse elements and attributes in a web page (especially an XML document). xpath belongs to the Html path language, which is a language that can be used to determine the location of a part of an Html document.
3. Scapy, a fast, high-level screen-crawling and web-crawling framework (crawler framework) developed by Python, is used to crawl web sites and extract structured data from web pages. The Scapy has wide application range and can be used for data mining, monitoring and automatic testing. In the crawler script based on the script framework, the most critical step is to identify the xpath path of the required content in the webpage so as to crawl the specified webpage content.
4. The internet crawler is a program or script for automatically capturing world wide web information according to a certain rule, and the method mainly comprises two modes:
the first is the whole network crawling of similar hundred-degree search engines;
the second is directional crawling for a certain category, which means crawling for a specified web page content (the targeted content of a specified web page).
However, as for the directional crawling manner, as described above, since the web page layout of the government website is relatively cluttered, the xpath path of the specified web page content (the content required in the web page) is acquired, and a professional is required to check the web page source code on the premise of the url address of the existing web page, and after analysis, the correct xpath path is acquired.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for automatically acquiring xpath generation crawler script.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method for automatically acquiring an xpath generation crawler script comprises the following steps:
step 1, acquiring a url address of a webpage, opening the webpage through the url address, and traversing all < a > tags in the webpage;
the < a > tag is used to define a hyperlink;
step 2, taking out the xpath path corresponding to each < a > tag;
and 3, grouping the < a > tags according to the following principle: if the xpath paths are the same, dividing the xpath paths into a group; then, counting the number of the tags < a > after grouping;
step 4, taking out one < a > tag in each group, and opening the link webpage;
step 5, counting the number of the < a > tags and the number of characters in the webpage for each opened webpage in the step 4;
the character number refers to the character number of the < a > label;
step 6, taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path;
and 7, generating a crawler script according to the corresponding xpath path based on the script framework.
Further, in the method for automatically acquiring the xpath-generated crawler script as described above, in step 1, the web page is a web page including a topic information list.
Further, according to the method for automatically acquiring the xpath generated crawler script, the specific steps of the step 2 are as follows:
step 2.1, acquiring a parent label of the < a > label through a jsup packet;
step 2.2, recursion calling, namely, obtaining a parent label of each parent label;
step 2.3, ending until the obtained parent label is < html >;
and 2.4, sequentially connecting all the obtained parent-level tags to obtain an xpath path of the < a > tag.
Further, according to the method for automatically acquiring the xpath crawling script, the specific step in the step 5 is as follows: all the < a > tags and the corresponding text () content are obtained through the jsup, and the number of the < a > tags and the number of the characters of the text () content are counted.
Further, according to the method for automatically acquiring the xpath crawling script, the specific step in step 7 is as follows: and (4) sending the url address of the webpage in the step (1) and the xpath path recorded in the step (6) to a script frame to generate a corresponding script of the script crawler.
A system for automatically acquiring an xpath generation crawler script for realizing the method comprises the following steps:
the label traversal module is used for opening a webpage corresponding to the url address and traversing all the labels < a > in the webpage;
an xpath path generation module, configured to take out an xpath path corresponding to each < a > tag;
the tag grouping module is used for grouping the < a > tags and counting the number of the < a > tags after grouping;
a link web page obtaining module for opening the link web page according to the < a > tag;
the information counting module is used for counting the number of the < a > tags and the number of characters in the webpage;
the xpath path judging module is used for analyzing the xpath path corresponding to the packet with the maximum number of characters and the minimum number of the < a > tags;
and the crawler script generation module is used for generating a crawler script according to the corresponding xpath path.
Further, according to the system for automatically acquiring the xpath generated crawler script, the crawler script generation module generates the corresponding script of the script according to the url address and the xpath path of the webpage based on the script framework.
The invention has the beneficial effects that: by the method, the xpath path can be automatically acquired only through the url address of the webpage without the participation of professional technicians.
By the method, the xpath path of the needed content in the webpage is automatically positioned, and the crawler script is generated, so that crawler management is simplified and automated; originally, technical personnel have to compile the crawler script, can only maintain nearly hundreds of crawler scripts a month, promote to a ordinary person and can rely on this method to maintain thousands of crawler scripts, greatly improved work efficiency.
Drawings
FIG. 1 is a flowchart of a method for automatically obtaining an xpath-generated crawler script according to an embodiment of the present invention;
fig. 2 is a block diagram of a system for automatically obtaining an xpath-generated crawler script according to an embodiment of the present invention.
FIG. 3 is an example of a government website web page.
Fig. 4 is a schematic view illustrating the opening of a web page corresponding to the < a > tag in the web page shown in fig. 3.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
Scapy is a currently mainstream crawler framework, and it needs to obtain the Html unique tag of the content (the content that needs to be crawled, such as a topic information list) in the webpage, i.e. the xpath path, to be able to crawl. Therefore, one of the purposes of the invention is to automatically identify the xpath path where the topic information list in the webpage is located, crawl only the information in the topic information list, and filter out other information in the webpage.
The method of the invention mainly comprises the following steps: government websites are relatively common to the structure of web pages containing information on the public of government websites. A web page containing public information about a government site, comprising in large part: menu navigation, notification announcements, subject information lists, other navigation, advertising and other links, etc. However, different web pages have different xpath paths corresponding to the subject information list, and the method provided by the invention can automatically screen the xpath path of the subject information list, automatically generate a crawler script and crawl the information in the subject information list.
Fig. 1 is a flowchart illustrating a method for automatically obtaining an xpath-generated crawler script according to an embodiment of the present invention, where the method mainly includes the following steps:
step 1, acquiring a url address of a webpage, opening the webpage through the url address, and traversing all < a > tags in the webpage;
the < a > tag is used to define a hyperlink;
the webpage is a webpage containing a theme information list, and url addresses of the webpage can be manually collected and sorted, and can also be preset according to the architectures of different government websites;
the specific algorithm for traversing all < a > tags in a web page is as follows: calling a method of the jsup packet to acquire all < a > tags and contents thereof in the webpage; the jsup is a package for analyzing a webpage, is developed by java, and provides a mode similar to DOM and CSS selectors for searching and extracting contents in a document;
step 2, taking out the xpath path corresponding to each < a > tag;
the specific algorithm is as follows:
step 2.1, acquiring a parent label of the < a > label through a jsup packet;
step 2.2, recursion calling, namely, obtaining a parent label of each parent label;
step 2.3, ending until the obtained parent label is < html >;
step 2.4, all the obtained parent-level tags are connected in sequence to obtain an xpath path of the < a > tag;
and 3, grouping the < a > tags according to the following principle: if the xpath paths are the same, dividing the xpath paths into a group; namely: dividing < a > tags with the same Xpath path into a group; then, counting the number of tags < a > after grouping;
step 4, taking out one < a > tag in each group, and opening the link webpage;
due to the grouping in step 3, the opened linked web pages are the same for any < a > tag in the group;
step 5, counting the number of the < a > tags and the number of characters in the webpage for each opened webpage in the step 4;
the character number refers to the character number of the < a > label;
the specific algorithm is as follows: acquiring all the < a > tags and corresponding text () contents through a jsup, and counting the number of the < a > tags and the number of characters of the text () contents;
the number of < a > tags is the number of < a > tags;
the number of characters is the number of characters;
step 6, taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path;
step 7, based on the Scapy framework, generating a crawler script according to the corresponding xpath path, wherein the crawler script refers to the Scapy crawler script;
the specific algorithm is as follows: sending the url address of the webpage in the step 1 and the xpath path recorded in the step 6 to a script frame to generate a corresponding script of the script crawler;
the script of the script crawler only needs the url address and the xpath path of the webpage, and other contents are basically fixed, so that the script of the script crawler can be generated through the script framework only by obtaining the url and the xpath.
According to the method, through verification of hundreds of government web pages, 80% of web pages containing government website public information can be automatically analyzed to obtain an xpath path corresponding to the subject information list, so that a crawler script is generated, the average time consumption of the method is about 1 minute, and the method is proved to be feasible and high in efficiency.
However, the method does not have the special architecture of some websites, and the method can improve the algorithms of the steps 5 and 6 by summarizing the characteristics of the architectures of the websites according to the specific situations to achieve the purpose of accurately acquiring the xpath path. For example: some web page list links are only a few, while other links are many, and such special cases are not in the scope of the present invention and will not be described in detail.
One embodiment is as follows.
As shown in fig. 3, an example of a web page for a government website, in this example can be seen:
the menu navigation at the top is performed,
the other navigation on the left-hand side,
the lower portion of the advertisement and other links,
a list of subject information in the middle.
The url address of the web page is known.
The method comprises the following specific steps:
step 1, after the webpage is opened, traversing all < a > tags in the webpage;
the following < a > tags are in the web page:
< a href ═ default/news/news/news/nid/5701 >
'a href'/defaults/news/news/nid/5693 'title ═ establishing a period government procurement project video monitoring system special equipment procurement bid notice' >
< a href ═ default/news/news/news/nid/5692 >
(too many labels, not all enumerated)
Step 2, taking out the xpath path corresponding to each < a > tag; the xpath path corresponding to each < a > tag is as follows:
//*[@id="nav_right"]/ul[2]/li[1]/a
//*[@id="newslist"]/ul/li[1]/span[1]/a
//*[@id="newstype_list"]/dl/dt[1]/a
//*[@id="footer"]/div[2]/ul/li[1]/a
step 3, grouping the < a > tags according to the same xpath path; grouping results are as follows:
web page content Xpath path <a>Number of labels
Menu navigation //*[@id="nav_right"]/ul[2]/li[1]/a 6
Subject information list //*[@id="newslist"]/ul/li[1]/span[1]/a 15
Other navigations //*[@id="newstype_list"]/dl/dt[1]/a 5
Advertisements and other links //*[@id="footer"]/div[2]/ul/li[1]/a 5
Step 4, one < a > tag in each group is taken out, the linked web page is opened, in this embodiment, four groups are provided, and four corresponding web pages are as shown in fig. 4,
step 5, counting the number of the < a > tags and the number of characters in the webpage for each opened webpage in the step 4; the statistical results are as follows:
web page content <a>Number of labels Number of web page characters
Menu navigation 42 222
Subject information list 33 1607
Other navigations 59 836
Advertisements and other links 44 283
Step 6, taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path;
according to the table, the path corresponding to the subject information list which accords with the condition that the number of characters is the most and the number of the < a > labels is the least is the xpath path;
and 7, generating a crawler script according to the xpath path corresponding to the theme information list.
Therefore, the xpath path of the topic information list is judged according to the number of the webpage characters of each group of the < a > tags and the number of the < a > tags, so that the crawler management can be completed by ordinary personnel without the participation of professional technicians in identification.
Corresponding to the method shown in fig. 1, an embodiment of the present invention further provides a system for automatically acquiring an xpath-generated crawler script, as shown in fig. 2, where the system includes:
the label traversal module is used for opening a webpage corresponding to the url address and traversing all the labels < a > in the webpage;
an xpath path generation module, configured to take out an xpath path corresponding to each < a > tag;
the tag grouping module is used for grouping the < a > tags and counting the number of the < a > tags after grouping;
a link web page obtaining module for opening the link web page according to the < a > tag;
the information counting module is used for counting the number of the < a > tags and the number of characters in the webpage;
the xpath path judging module is used for analyzing the xpath path corresponding to the packet with the maximum number of characters and the minimum number of the < a > tags;
and the crawler script generation module is used for generating a crawler script according to the corresponding xpath path.
On the basis of the technical scheme, the crawler script generation module generates a corresponding script of the script crawler according to the url address and the xpath path of the webpage based on the script framework.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims (3)

1. A method for automatically acquiring a script for generating a crawler by using a government website xpath comprises the following steps:
step 1, acquiring a url address of a webpage, opening the webpage through the url address, and traversing all < a > tags in the webpage;
the < a > tag is used to define a hyperlink; the webpage is a government website containing a subject information list; the specific algorithm for traversing all < a > tags in a web page is as follows: calling a method of the jsup packet to acquire all tags and contents thereof in the webpage;
step 2, taking out the xpath path corresponding to each < a > tag;
the specific steps of the step 2 are as follows:
step 2.1, acquiring a parent label of the < a > label through a jsup packet;
step 2.2, recursion calling, namely, obtaining a parent label of each parent label;
step 2.3, ending until the obtained parent label is < html >;
step 2.4, all the obtained parent-level tags are connected in sequence to obtain an xpath path of the < a > tag;
and 3, grouping the < a > tags according to the following principle: if the xpath paths are the same, dividing the xpath paths into a group; then, counting the number of the tags < a > after grouping;
based on the grouping of the step 3, the labels in any one group are the same;
the grouping result comprises: menu navigation, subject information lists, other navigation, advertising, and other links;
step 4, taking out one < a > tag in each group, and opening the link webpage;
step 5, counting the number of the < a > tags and the number of characters in the webpage for each opened webpage in the step 4;
the character number refers to the character number of the < a > label;
the specific steps of the step 5 are as follows: acquiring all the < a > tags and corresponding text () contents through a jsup, and counting the number of the < a > tags and the number of characters of the text () contents;
step 6, taking out a group with the most characters and the least < a > labels, and recording the corresponding xpath path;
step 7, based on the Scapy framework, generating a crawler script according to the corresponding xpath path;
the specific steps of the step 7 are as follows: and (4) sending the url address of the webpage in the step (1) and the xpath path recorded in the step (6) to a script frame to generate a corresponding script of the script crawler.
2. The method for automatically acquiring the script of the government website xpath to generate the crawler according to claim 1, wherein the script comprises the following steps: in step 1, the web page is a web page containing a topic information list.
3. A system for automatically obtaining government website xpath generated crawler scripts, comprising:
the label traversal module is used for opening a webpage corresponding to the url address and traversing all the labels < a > in the webpage; the < a > tag is used to define a hyperlink; the webpage is a government website containing a subject information list; the specific algorithm for traversing all < a > tags in a web page is as follows: calling a method of the jsup packet to acquire all tags and contents thereof in the webpage;
an xpath path generation module, configured to take out an xpath path corresponding to each < a > tag; the xpath path generation module acquires a parent label of the < a > label through a jsup packet; then, carrying out recursive call, namely obtaining a parent label of each parent label again until the obtained parent label is < html >, and finally, sequentially connecting all the obtained parent labels to obtain an xpath path of the < a > label;
the tag grouping module is used for grouping the < a > tags and counting the number of the < a > tags after grouping; based on the grouping of the label grouping module, the labels in any one group are the same; the grouping result comprises: menu navigation, subject information lists, other navigation, advertising, and other links;
a link web page obtaining module for opening the link web page according to the < a > tag;
the information counting module is used for counting the number of the < a > tags and the number of characters in the webpage; the character number refers to the character number of the < a > label; the information statistic module obtains all the < a > tags and corresponding text () contents through the jsup, and counts the number of the < a > tags and the number of characters of the text () contents;
the xpath path judging module is used for analyzing the xpath path corresponding to the packet with the maximum number of characters and the minimum number of the < a > labels;
the crawler script generation module is used for generating a crawler script according to the corresponding xpath path; and the crawler script generating module sends the url address of the webpage of the < a > tag traversal module and the xpath path recorded by the xpath path judging module to the script frame to generate a corresponding script of the script.
CN201711034452.1A 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script Active CN107943838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711034452.1A CN107943838B (en) 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711034452.1A CN107943838B (en) 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script

Publications (2)

Publication Number Publication Date
CN107943838A CN107943838A (en) 2018-04-20
CN107943838B true CN107943838B (en) 2021-09-07

Family

ID=61936673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711034452.1A Active CN107943838B (en) 2017-10-30 2017-10-30 Method and system for automatically acquiring xpath generated crawler script

Country Status (1)

Country Link
CN (1) CN107943838B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109246069B (en) * 2018-06-15 2020-10-16 华为技术有限公司 Webpage login method and device and readable storage medium
CN109657117A (en) * 2018-11-12 2019-04-19 厦门市美亚柏科信息股份有限公司 A kind of extraction method, system and the computer storage medium of webpage element
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy
CN111444407B (en) * 2020-03-26 2023-05-16 桂林理工大学 Automatic extraction method and system for page list information of web crawlers
CN111460259B (en) * 2020-03-31 2023-04-14 腾讯科技(深圳)有限公司 Method and device for determining similar elements, computer equipment and storage medium
CN111831874B (en) * 2020-07-16 2022-08-19 深圳赛安特技术服务有限公司 Webpage data information acquisition method and device, computer equipment and storage medium
CN112099778B (en) * 2020-11-13 2021-02-02 北京智慧星光信息技术有限公司 Data acquisition method based on xpath, electronic equipment and storage medium
CN112417252B (en) * 2020-12-04 2023-05-09 天津开心生活科技有限公司 Crawler path determination method and device, storage medium and electronic equipment
CN114201971B (en) * 2021-12-13 2023-06-13 海南港航控股有限公司 Method and system for extracting character attribute from webpage

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103020156A (en) * 2012-11-23 2013-04-03 北京小米科技有限责任公司 Processing method, device and equipment for webpage
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN104142985A (en) * 2014-07-23 2014-11-12 哈尔滨工业大学(威海) Semi-automatic vertical crawler generation tool and method
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360882B (en) * 2014-11-07 2018-07-27 北京奇虎科技有限公司 Display methods and device are carried out to picture in webpage in a kind of browser
CN107066576B (en) * 2017-04-12 2019-11-12 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging selection method and system
CN107016102B (en) * 2017-04-12 2019-12-03 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging configuration method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103020156A (en) * 2012-11-23 2013-04-03 北京小米科技有限责任公司 Processing method, device and equipment for webpage
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN104142985A (en) * 2014-07-23 2014-11-12 哈尔滨工业大学(威海) Semi-automatic vertical crawler generation tool and method
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device

Also Published As

Publication number Publication date
CN107943838A (en) 2018-04-20

Similar Documents

Publication Publication Date Title
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN103927370B (en) Network information batch acquisition method of combined text and picture information
CN109857956B (en) News webpage key information automatic extraction method based on label and block characteristics
CN108334641B (en) Method, system, electronic equipment and storage medium for collecting user behavior data
CN107153716B (en) Webpage content extraction method and device
CN103902889A (en) Malicious message cloud detection method and server
CN103605738A (en) Webpage access data statistical method and webpage access data statistical device
CN102306201B (en) Method and system for analyzing webpage title
CN105468744A (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN103838796A (en) Webpage structured information extraction method
CN103530429A (en) Webpage content extracting method
CN107526833B (en) URL management method and system
CN102402563A (en) Network information screening method and device
CN109240664A (en) A kind of method and terminal acquiring user behavior information
CN102902790A (en) Web page classification system and method
CN104317884A (en) Method and device for acquiring types of source pages of website
CN104636340A (en) Webpage URL filtering method, device and system
CN105183843A (en) List page recognition system and method
CN101576933A (en) Fully-automatic grouping method of WEB pages based on title separator
CN109614535B (en) Method and device for acquiring network data based on Scapy framework
CN102929948A (en) List page identification system and method
CN111241446B (en) Method, device, equipment and medium for extracting text content of web page
CN114706948A (en) News processing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Ji Yongjie

Inventor after: Chen Guoqiang

Inventor after: Ren Jianxin

Inventor before: Ji Yongjie

Inventor before: Chen Guoqiang

Inventor before: Wang Changyong

Inventor before: Ren Jianxin

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method and system for automatically obtaining xpath to generate crawler scripts

Effective date of registration: 20231114

Granted publication date: 20210907

Pledgee: Beijing first financing Company limited by guarantee

Pledgor: BEIJING DASY TECHNOLOGY DEVELOPMENT CO.,LTD.

Registration number: Y2023110000472

PE01 Entry into force of the registration of the contract for pledge of patent right