CN106570133A - Method and device for constructing visual webpage information extracting rule - Google Patents

Method and device for constructing visual webpage information extracting rule Download PDF

Info

Publication number
CN106570133A
CN106570133A CN201610956895.5A CN201610956895A CN106570133A CN 106570133 A CN106570133 A CN 106570133A CN 201610956895 A CN201610956895 A CN 201610956895A CN 106570133 A CN106570133 A CN 106570133A
Authority
CN
China
Prior art keywords
page
rule
web
extracting rule
info
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610956895.5A
Other languages
Chinese (zh)
Other versions
CN106570133B (en
Inventor
李少敏
王毅敏
范娜
刘刚
唐新民
沈智杰
景晓军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SURFILTER NETWORK TECHNOLOGY Co Ltd
Original Assignee
SURFILTER NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SURFILTER NETWORK TECHNOLOGY Co Ltd filed Critical SURFILTER NETWORK TECHNOLOGY Co Ltd
Priority to CN201610956895.5A priority Critical patent/CN106570133B/en
Publication of CN106570133A publication Critical patent/CN106570133A/en
Application granted granted Critical
Publication of CN106570133B publication Critical patent/CN106570133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method and a device for constructing a visual webpage information extracting rule. The method comprises the following steps: according to a webpage element selected by a user, obtaining parameter information of the webpage element by employing a webpage node analysis algorithm; according to the obtained parameter information of the webpage element, carrying out filling on configuration parameters required by corresponding webpage information extracting actions; and in a preset visual rule action management area, carrying out corresponding operations on the required webpage information extracting actions to generate the corresponding webpage information extracting rule. According to the method for constructing the visual webpage information extracting rule provided by the invention, not only is the analysis of the user on a webpage structure avoided, and the professional requirement of the user reduced, but the webpage information extracting action management convenient to operate is also provided for the user in the preset visual rule action management area; the difficulty of compilation and maintenance of the user on the webpage information extracting rule is greatly reduced; and the construction efficiency of the webpage information extracting rule is improved.

Description

A kind of construction method and device of visual info web extracting rule
Technical field
The present invention relates to info web extractive technique field, more particularly to a kind of visual info web extracting rule Construction method and device.
Background technology
Info web extractive technique is a kind of technology that target information is extracted from webpage.In exploitation for a certain field When data analysiss product or service, needs go to extract data from the magnanimity internet data of each website, wherein, to single When the carrying out of Website page data message is extracted, programming personnel can be by building rule come convenient consistent to structure of web page many Individual webpage carries out the batch extracting of target information.
However, prior art builds in Objective extraction rule having the following disadvantages:First, extracting rule writes very important person Analyzing structure of web page is removed, therefrom obtaining can be with the container residing for the selector of unique mark destination node and object content, this Requirement of the sample just to regular writer is higher, it is necessary to html HTMLs and css selectores or xpath Have gained some understanding, writer's specialty is had high demands, and then cause development cost higher.Secondly, in extraction process, often Because rule writes mistake or the page changes and causes data to extract exception, people is needed to enter to rule and structure of web page again Row analysis, this manual maintenance efficiency is low.
The content of the invention
Writer's specialty is had high demands in order to solve existing extracting rule construction method, writes that maintenance efficiency is low to ask Topic, embodiments provides the construction method and device of a kind of visual info web extracting rule.The technical side Case is as follows:
On the one hand, a kind of construction method of visual info web extracting rule is embodiments provided, it is described Method includes:
According to the web page element that user selects, using web page joint parser the parameter information of web page element, institute are obtained Stating parameter information includes:Extensible markup language path language (Xml Path Language, the abbreviation of web page element " xpath "), attribute and textual value;
According to the web page element parameter information for getting, the configuration parameter needed for corresponding info web extraction action is entered Row filling;
In default visual rule action directorial area, corresponding operating is carried out to required info web extraction action, Generate correspondingly info web extracting rule.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
The info web extracting rule that operation is generated, generates corresponding execution journal and page snapshot, the execution journal The implementing result of each info web extraction action in for recording info web extracting rule;
According to default proof rule, corresponding the result is obtained, the result is used to safeguard the page for user Information retrieval rule.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, the basis is preset Proof rule, obtain corresponding the result, including:
Execution journal to generating is parsed, and obtains the info web extraction action for performing failure;
Using default image comparison technology, the page snapshot of generation and parent page are contrasted, obtained the page fast With the different position of parent page according in.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
Whether the result according to obtaining judges page extracting rule by confirmatory operation;
When page extracting rule successfully passes confirmatory operation, webpage extracting rule is published to into default rule storehouse In, carry out batch page info extraction.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, each described webpage Information retrieval action is preset with the independent configuration page for filling configuration parameter.
On the other hand, a kind of construction device of visual info web extracting rule, institute are embodiments provided Stating device includes:
First acquisition module, for the web page element selected according to user, using web page joint parser webpage is obtained The parameter information of element, the parameter information includes:The xpath of web page element, attribute and textual value;
Processing module, for according to the web page element parameter information for getting, to corresponding info web extraction action institute The configuration parameter for needing is filled;
First generation module, in default visual rule action directorial area, extracting to required info web Action carries out corresponding operating, generates correspondingly info web extracting rule.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
Second generation module, for the info web extracting rule that operation is generated, generates corresponding execution journal and the page Snapshot, the execution journal is used to record the implementing result of each info web extraction action in info web extracting rule;
Second acquisition module, for according to default proof rule, obtaining corresponding the result, the result is used In safeguarding page info extracting rule for user.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, described second obtains Module, including:
First acquisition unit, for parsing to the execution journal for generating, obtains the info web extraction for performing failure Action;
Second acquisition unit, for using default image comparison technology, the page snapshot of generation and parent page being entered Row contrast, with the different position of parent page in acquisition page snapshot.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
Judge module, for judging page extracting rule whether by confirmatory operation according to the result for obtaining;
Release module, for when page extracting rule successfully passes confirmatory operation, webpage extracting rule being published to In default rule storehouse, batch page info extraction is carried out.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, each described webpage Information retrieval action is preset with the independent configuration page for filling configuration parameter.
The beneficial effect that technical scheme provided in an embodiment of the present invention is brought is:
By the web page element selected according to user, using web page joint parser the parameter letter of web page element is obtained Breath;Then according to the web page element parameter information for getting, the configuration parameter needed for corresponding info web extraction action is entered Row filling;Finally in default visual rule action directorial area, required info web extraction action is accordingly grasped Make, generate correspondingly info web extracting rule.Analysis of the user to structure of web page had so both been eliminated, the special of user had been reduced Industry requires, and is user in default visual rule action directorial area, there is provided the info web of convenient operation is extracted Action management, greatly reduces the difficulty writing and safeguard of the user to info web extracting rule, improves info web and carries Take the structure efficiency of rule.
Description of the drawings
Technical scheme in order to be illustrated more clearly that the embodiment of the present invention, below will be to making needed for embodiment description Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, can be obtaining other according to these accompanying drawings Accompanying drawing.
Fig. 1 is the construction method flow process of a kind of visual info web extracting rule that the embodiment of the present invention one is provided Figure;
Fig. 2 is a kind of example at visualized operation interface that the embodiment of the present invention one is provided;
Fig. 3 is the construction method flow process of a kind of visual info web extracting rule that the embodiment of the present invention one is provided Figure;
Fig. 4 is that the construction device structure of a kind of visual info web extracting rule that the embodiment of the present invention two is provided is shown It is intended to;
Fig. 5 is that the construction device structure of a kind of visual info web extracting rule that the embodiment of the present invention two is provided is shown It is intended to;
Fig. 6 is a kind of structural representation of second acquisition module that the embodiment of the present invention two is provided.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.
Embodiment one
A kind of construction method of visual info web extracting rule is embodiments provided, referring to Fig. 1, the party Method can include:
Step S11, according to the web page element that user selects, using web page joint parser the parameter of web page element is obtained Information, the parameter information can include:The xpath of web page element, attribute and textual value.
In the present embodiment, the info web that user wants to extract is included in web page element.Xpath is extensible markup Language (Extensible Markup Language, referred to as " XML ") path language, it be one kind for determining XML document in The language of certain portion, trees of the xpath based on XML, there is provided the ability of node is looked in data-structure tree.In reality In the application of border, xpath, attribute and the textual value for obtaining web page element using web page joint parser is prior art, this In repeat no more.
Step S12, according to the web page element parameter information for getting, matching somebody with somebody needed for corresponding info web extraction action Put parameter to be filled.
In the present embodiment, info web extracting rule is made up of multiple info web extraction actions, info web Extraction action is then rules unit independent one by one.Action below by taking the browser of Chromium kernels as an example, to commonly using Illustrate:
Explanation:If there is checkers in params, certain element being to wait in the page, until meeting certain Part, if still not up to meet in the range of time-out time required, rule script time-out is exited.
By the above-mentioned info web extraction action for implementing it is recognised that page object can be opened by open actions Face, by click click actions object element in loading page can be triggered, and by wait actions object element can be waited Loading is completed, and by data actions target information extraction can be carried out.
In the present embodiment, each info web extraction action needs the web page element according to required extraction, carries out corresponding Parameter configuration so that info web extraction action can be operated to web page element.
In actual applications, Fig. 2 is a kind of example at visualized operation interface, when user have chosen net in webview areas During page element, the related parameter information of web page element can be automatically generated in rule editing area, user only needs to therefrom select institute The parameter of care generates corresponding info web extraction action.
Specifically, each info web extraction action is preset with the independent configuration page for filling configuration parameter.
In the present embodiment, a complete info web extracting rule may extract action group by multiple info webs Into the focus of every kind of info web extraction action are different, and each info web extraction action needs the info web for obtaining not Together, the page is configured also just different, it is separate between the different action configuration pages not couple, facilitate follow-up maintenance and extension.
Step S13, in default visual rule action directorial area, to required info web extraction action phase is carried out Should operate, generate correspondingly info web extracting rule.
In the present embodiment, user can be in visual rule action directorial area as shown in Figure 2, to info web Extraction action such as is added, inserts, deleting, editing at the operation.When multiple page info extraction actions are with certain execution sequence When editor completes, correspondingly info web extracting rule has been generated as.
Alternatively, referring to Fig. 3, the method can also include:
Step S14, runs the info web extracting rule for generating, and generates corresponding execution journal and page snapshot, performs Daily record is used to record the implementing result of each info web extraction action in info web extracting rule.
In the present embodiment, by the construction method of visual info web extracting rule, the info web of generation is carried Taking rule can carry out corresponding confirmatory operation, and be sentenced using the execution journal and page snapshot of confirmatory operation generation Whether suspension page information extracting rule is qualified.
Step S15, according to default proof rule, obtains corresponding the result, and the result is used to be safeguarded for user Page info extracting rule.
Specifically, above-mentioned steps S15 can be realized in the following way:
Execution journal to generating is parsed, and obtains the info web extraction action for performing failure;
Using default image comparison technology, the page snapshot of generation and parent page are contrasted, obtained the page fast With the different position of parent page according in.
In the present embodiment, if record has the info web extraction action for performing failure in execution journal, or, generate Page snapshot and parent page have difference, then user can be according to above- mentioned information, in page info extracting rule The info web extraction action to be changed is safeguarded, and the page info extracting rule after maintenance is re-started confirmatory Operation, until it successfully passes confirmatory operation.
Alternatively, referring to Fig. 3, the method can also include:
Whether step S16, judge page extracting rule by confirmatory operation according to the result for obtaining.If the page Extracting rule successfully passes confirmatory operation, then execution step S17, if the success of page extracting rule is not over confirmatory fortune OK, then user needs to safeguard page info extracting rule accordingly according to the result, and the page after maintenance is believed Breath extracting rule re-starts confirmatory operation.
Step S17, webpage extracting rule is published in default rule storehouse, carries out batch page info extraction.
The embodiment of the present invention obtains webpage unit by the web page element selected according to user using web page joint parser The parameter information of element;Then according to the web page element parameter information for getting, to needed for corresponding info web extraction action Configuration parameter is filled;Finally in default visual rule action directorial area, to required info web extraction action Corresponding operating is carried out, correspondingly info web extracting rule is generated.Analysis of the user to structure of web page was so both eliminated, had been reduced The specialty requirement of user, and be user in default visual rule action directorial area, there is provided the net of convenient operation Page information extracts action management, greatly reduces the difficulty writing and safeguard of the user to info web extracting rule, improves The structure efficiency of info web extracting rule.
Embodiment two
A kind of construction device of visual info web extracting rule is embodiments provided, embodiment is employed The construction method of the visual info web extracting rule described in, referring to Fig. 4, the device can include:First obtains mould Block 100, processing module 200, the first generation module 300.
First acquisition module 100, for the web page element selected according to user, using web page joint parser net is obtained The parameter information of page element, parameter information includes:The xpath of web page element, attribute and textual value.
In the present embodiment, the info web that user wants to extract is included in web page element.Xpath is XML paths language Speech, it be it is a kind of for determining XML document in certain portion language, xpath based on XML tree, there is provided counting According to the ability that node is looked in structure tree.In actual applications, web page element is obtained using web page joint parser Xpath, attribute and textual value are prior art, are repeated no more here.
Processing module 200, for according to the web page element parameter information for getting, to corresponding info web extraction action Required configuration parameter is filled.
In the present embodiment, info web extracting rule is made up of multiple info web extraction actions, info web Extraction action is then rules unit independent one by one.Target pages can be opened by open actions, clicked on by click dynamic Work can trigger object element in loading page, by wait actions the loading of object element can be waited to complete, by data Action can carry out target information extraction.
In the present embodiment, each info web extraction action needs the web page element according to required extraction, carries out corresponding Parameter configuration so that info web extraction action can be operated to web page element.
Specifically, each info web extraction action is preset with the independent configuration page for filling configuration parameter.
In the present embodiment, a complete info web extracting rule may extract action group by multiple info webs Into the focus of every kind of info web extraction action are different, and each info web extraction action needs the info web for obtaining not Together, the page is configured also just different, it is separate between the different action configuration pages not couple, facilitate follow-up maintenance and extension.
First generation module 300, in default visual rule action directorial area, carrying to required info web Taking action carries out corresponding operating, generates correspondingly info web extracting rule.
In the present embodiment, user can enter in visual rule action directorial area to info web extraction action The operations such as row addition, insertion, deletion, editor.When multiple page info extraction actions are completed with certain execution sequence editor, It has been generated as correspondingly info web extracting rule.
Alternatively, referring to Fig. 5, the device can also include:Second generation module 400, the second acquisition module 500.
Second generation module 400, for the info web extracting rule that operation is generated, generates corresponding execution journal and page Face snapshot, execution journal is used to record the implementing result of each info web extraction action in info web extracting rule.
In the present embodiment, by the construction device of visual info web extracting rule, the info web of generation is carried Taking rule can carry out corresponding confirmatory operation, and be sentenced using the execution journal and page snapshot of confirmatory operation generation Whether suspension page information extracting rule is qualified.
Second acquisition module 500, for according to default proof rule, obtaining corresponding the result, the result is used In safeguarding page info extracting rule for user.
Further, referring to Fig. 6, the second acquisition module 500 can include:First acquisition unit 501, second acquisition unit 502。
First acquisition unit 501, for parsing to the execution journal for generating, the info web for obtaining execution failure is carried Take action.
Second acquisition unit 502, for using default image comparison technology, by the page snapshot for generating and parent page Contrasted, with the different position of parent page in acquisition page snapshot.
In the present embodiment, if record has the info web extraction action for performing failure in execution journal, or, generate Page snapshot and parent page have difference, then user can be according to above- mentioned information, in page info extracting rule The info web extraction action to be changed is safeguarded, and the page info extracting rule after maintenance is re-started confirmatory Operation, until it successfully passes confirmatory operation.
Alternatively, referring to Fig. 5, the device can also include:Judge module 600, release module 700.
Judge module 600, for judging page extracting rule whether by confirmatory operation according to the result for obtaining.
Release module 700, for when page extracting rule successfully passes confirmatory operation, webpage extracting rule being issued To in default rule storehouse, batch page info extraction is carried out.
The embodiment of the present invention obtains webpage unit by the web page element selected according to user using web page joint parser The parameter information of element;Then according to the web page element parameter information for getting, to needed for corresponding info web extraction action Configuration parameter is filled;Finally in default visual rule action directorial area, to required info web extraction action Corresponding operating is carried out, correspondingly info web extracting rule is generated.Analysis of the user to structure of web page was so both eliminated, had been reduced The specialty requirement of user, and be user in default visual rule action directorial area, there is provided the net of convenient operation Page information extracts action management, greatly reduces the difficulty writing and safeguard of the user to info web extracting rule, improves The structure efficiency of info web extracting rule.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
It should be noted that:The construction device of the visual info web extracting rule that above-described embodiment is provided is being realized During the construction method of visual info web extracting rule, only it is illustrated with the division of above-mentioned each functional module, it is real Border application in, can as desired by above-mentioned functions distribution be completed by different functional modules, will equipment internal structure Different functional modules are divided into, to complete all or part of function described above.In addition, above-described embodiment provide can The construction device of info web extracting rule depending on changing belongs to the construction method embodiment of visual info web extracting rule In same design, it implements process and refers to embodiment of the method, repeats no more here.
One of ordinary skill in the art will appreciate that realizing all or part of step of above-described embodiment can pass through hardware To complete, it is also possible to which the hardware that correlation is instructed by program is completed, and described program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read only memory, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, not to limit the present invention, all spirit in the present invention and Within principle, any modification, equivalent substitution and improvements made etc. should be included within the scope of the present invention.

Claims (10)

1. a kind of construction method of visual info web extracting rule, it is characterised in that methods described includes:
According to the web page element that user selects, using web page joint parser the parameter information of web page element, the ginseng are obtained Number information includes:The xpath of web page element, attribute and textual value;
According to the web page element parameter information for getting, the configuration parameter needed for corresponding info web extraction action is filled out Fill;
In default visual rule action directorial area, corresponding operating is carried out to required info web extraction action, generated Correspondingly info web extracting rule.
2. method according to claim 1, it is characterised in that also include:
The info web extracting rule that operation is generated, generates corresponding execution journal and page snapshot, and the execution journal is used for The implementing result of each info web extraction action in record info web extracting rule;
According to default proof rule, corresponding the result is obtained, the result is used to safeguard page info for user Extracting rule.
3. method according to claim 2, it is characterised in that described according to default proof rule, acquisition is tested accordingly Card result, including:
Execution journal to generating is parsed, and obtains the info web extraction action for performing failure;
Using default image comparison technology, the page snapshot of generation and parent page are contrasted, in obtaining page snapshot With the different position of parent page.
4. method according to claim 2, it is characterised in that also include:
Whether the result according to obtaining judges page extracting rule by confirmatory operation;
When page extracting rule successfully passes confirmatory operation, webpage extracting rule is published in default rule storehouse, is entered Row batch page info is extracted.
5. the method according to any one of claim 1-4, it is characterised in that each described info web extraction action is pre- It is provided with the independent configuration page for filling configuration parameter.
6. a kind of construction device of visual info web extracting rule, it is characterised in that described device includes:
First acquisition module, for the web page element selected according to user, using web page joint parser web page element is obtained Parameter information, the parameter information includes:The xpath of web page element, attribute and textual value;
Processing module, for according to the web page element parameter information for getting, to needed for corresponding info web extraction action Configuration parameter is filled;
First generation module, in default visual rule action directorial area, to required info web extraction action Corresponding operating is carried out, correspondingly info web extracting rule is generated.
7. device according to claim 6, it is characterised in that also include:
Second generation module, for the info web extracting rule that operation is generated, generates corresponding execution journal and page snapshot, The execution journal is used to record the implementing result of each info web extraction action in info web extracting rule;
Second acquisition module, for according to default proof rule, obtaining corresponding the result, the result is used to supply User safeguards page info extracting rule.
8. device according to claim 7, it is characterised in that second acquisition module, including:
First acquisition unit, for parsing to the execution journal for generating, obtains the info web extraction action for performing failure;
Second acquisition unit, for using default image comparison technology, it is right that the page snapshot of generation and parent page are carried out Than with the different position of parent page in acquisition page snapshot.
9. device according to claim 7, it is characterised in that also include:
Judge module, for judging page extracting rule whether by confirmatory operation according to the result for obtaining;
Release module, for when page extracting rule successfully passes confirmatory operation, webpage extracting rule being published to default Rule base in, carry out batch page info extraction.
10. the device according to any one of claim 6-9, it is characterised in that each described info web extraction action is equal It is preset with the independent configuration page for filling configuration parameter.
CN201610956895.5A 2016-10-27 2016-10-27 A kind of construction method and device of visual webpage information extracting rule Active CN106570133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610956895.5A CN106570133B (en) 2016-10-27 2016-10-27 A kind of construction method and device of visual webpage information extracting rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610956895.5A CN106570133B (en) 2016-10-27 2016-10-27 A kind of construction method and device of visual webpage information extracting rule

Publications (2)

Publication Number Publication Date
CN106570133A true CN106570133A (en) 2017-04-19
CN106570133B CN106570133B (en) 2019-07-23

Family

ID=58535373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610956895.5A Active CN106570133B (en) 2016-10-27 2016-10-27 A kind of construction method and device of visual webpage information extracting rule

Country Status (1)

Country Link
CN (1) CN106570133B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN108874977A (en) * 2018-06-08 2018-11-23 东软集团股份有限公司 Page data extracting method, device, storage medium and electronic equipment
CN109657117A (en) * 2018-11-12 2019-04-19 厦门市美亚柏科信息股份有限公司 A kind of extraction method, system and the computer storage medium of webpage element
CN112597377A (en) * 2020-12-25 2021-04-02 北京百度网讯科技有限公司 Information extraction module generation method, information extraction method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003308275A (en) * 2002-04-12 2003-10-31 Sharp Corp System and method for extracting webpage information
CN103714116A (en) * 2013-10-31 2014-04-09 北京奇虎科技有限公司 Webpage information extracting method and webpage information extracting equipment
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003308275A (en) * 2002-04-12 2003-10-31 Sharp Corp System and method for extracting webpage information
CN103714116A (en) * 2013-10-31 2014-04-09 北京奇虎科技有限公司 Webpage information extracting method and webpage information extracting equipment
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN108874977A (en) * 2018-06-08 2018-11-23 东软集团股份有限公司 Page data extracting method, device, storage medium and electronic equipment
CN108874977B (en) * 2018-06-08 2020-11-27 东软集团股份有限公司 Page data extraction method and device, storage medium and electronic equipment
CN109657117A (en) * 2018-11-12 2019-04-19 厦门市美亚柏科信息股份有限公司 A kind of extraction method, system and the computer storage medium of webpage element
CN112597377A (en) * 2020-12-25 2021-04-02 北京百度网讯科技有限公司 Information extraction module generation method, information extraction method and device

Also Published As

Publication number Publication date
CN106570133B (en) 2019-07-23

Similar Documents

Publication Publication Date Title
JP6842167B2 (en) Summary generator, summary generation method and computer program
He et al. Duplicate bug report detection using dual-channel convolutional neural networks
CN104267947B (en) A kind of editor's method of pop-up picture and pop-up picture editor's device
US7386558B2 (en) Methods and systems for filtering an Extensible Application Markup Language (XAML) file to facilitate indexing of the logical content contained therein
US20130305149A1 (en) Document reader and system for extraction of structural and semantic information from documents
CN101464905A (en) Web page information extraction system and method
Huynh et al. Enabling web browsers to augment web sites' filtering and sorting functionalities
Graliński et al. Kleister: A novel task for information extraction involving long documents with complex layout
CN106570133A (en) Method and device for constructing visual webpage information extracting rule
CA2698914A1 (en) Document segmentation
JPH11143912A (en) Related document display device
CN106960058A (en) A kind of structure of web page alteration detection method and system
CN116860987A (en) Domain knowledge graph construction method and system based on generation type large language model
Aumiller et al. Online dateing: a web interface for temporal annotations
EP2599042A1 (en) Systems and methods of rapid business discovery and transformation of business processes
Sanoja et al. Block-o-matic: a web page segmentation tool and its evaluation
KR101104753B1 (en) Extraction method for hierarchical structure in text contents of structural calculation document
Doulani et al. Analysis of Iranian and British university websites by world wide web consortium.
CN107203525A (en) The treating method and apparatus of database
CN110837614A (en) Method and system for efficiently generating webpage information extraction rule
CN112328246A (en) Page component generation method and device, computer equipment and storage medium
Maria et al. MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications.
Sithole et al. Attributes extraction for fine-grained differentiation of the Internet of Things patterns
CN113268412B (en) Control analysis method, device, equipment and medium for Web system test case recording
JP2004303097A (en) Partial document extraction program and partial document extraction method of structured document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant