CN106570133A - Method and device for constructing visual webpage information extracting rule - Google Patents
Method and device for constructing visual webpage information extracting rule Download PDFInfo
- Publication number
- CN106570133A CN106570133A CN201610956895.5A CN201610956895A CN106570133A CN 106570133 A CN106570133 A CN 106570133A CN 201610956895 A CN201610956895 A CN 201610956895A CN 106570133 A CN106570133 A CN 106570133A
- Authority
- CN
- China
- Prior art keywords
- page
- rule
- web
- extracting rule
- info
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a method and a device for constructing a visual webpage information extracting rule. The method comprises the following steps: according to a webpage element selected by a user, obtaining parameter information of the webpage element by employing a webpage node analysis algorithm; according to the obtained parameter information of the webpage element, carrying out filling on configuration parameters required by corresponding webpage information extracting actions; and in a preset visual rule action management area, carrying out corresponding operations on the required webpage information extracting actions to generate the corresponding webpage information extracting rule. According to the method for constructing the visual webpage information extracting rule provided by the invention, not only is the analysis of the user on a webpage structure avoided, and the professional requirement of the user reduced, but the webpage information extracting action management convenient to operate is also provided for the user in the preset visual rule action management area; the difficulty of compilation and maintenance of the user on the webpage information extracting rule is greatly reduced; and the construction efficiency of the webpage information extracting rule is improved.
Description
Technical field
The present invention relates to info web extractive technique field, more particularly to a kind of visual info web extracting rule
Construction method and device.
Background technology
Info web extractive technique is a kind of technology that target information is extracted from webpage.In exploitation for a certain field
When data analysiss product or service, needs go to extract data from the magnanimity internet data of each website, wherein, to single
When the carrying out of Website page data message is extracted, programming personnel can be by building rule come convenient consistent to structure of web page many
Individual webpage carries out the batch extracting of target information.
However, prior art builds in Objective extraction rule having the following disadvantages:First, extracting rule writes very important person
Analyzing structure of web page is removed, therefrom obtaining can be with the container residing for the selector of unique mark destination node and object content, this
Requirement of the sample just to regular writer is higher, it is necessary to html HTMLs and css selectores or xpath
Have gained some understanding, writer's specialty is had high demands, and then cause development cost higher.Secondly, in extraction process, often
Because rule writes mistake or the page changes and causes data to extract exception, people is needed to enter to rule and structure of web page again
Row analysis, this manual maintenance efficiency is low.
The content of the invention
Writer's specialty is had high demands in order to solve existing extracting rule construction method, writes that maintenance efficiency is low to ask
Topic, embodiments provides the construction method and device of a kind of visual info web extracting rule.The technical side
Case is as follows:
On the one hand, a kind of construction method of visual info web extracting rule is embodiments provided, it is described
Method includes:
According to the web page element that user selects, using web page joint parser the parameter information of web page element, institute are obtained
Stating parameter information includes:Extensible markup language path language (Xml Path Language, the abbreviation of web page element
" xpath "), attribute and textual value;
According to the web page element parameter information for getting, the configuration parameter needed for corresponding info web extraction action is entered
Row filling;
In default visual rule action directorial area, corresponding operating is carried out to required info web extraction action,
Generate correspondingly info web extracting rule.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
The info web extracting rule that operation is generated, generates corresponding execution journal and page snapshot, the execution journal
The implementing result of each info web extraction action in for recording info web extracting rule;
According to default proof rule, corresponding the result is obtained, the result is used to safeguard the page for user
Information retrieval rule.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, the basis is preset
Proof rule, obtain corresponding the result, including:
Execution journal to generating is parsed, and obtains the info web extraction action for performing failure;
Using default image comparison technology, the page snapshot of generation and parent page are contrasted, obtained the page fast
With the different position of parent page according in.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
Whether the result according to obtaining judges page extracting rule by confirmatory operation;
When page extracting rule successfully passes confirmatory operation, webpage extracting rule is published to into default rule storehouse
In, carry out batch page info extraction.
In the construction method of the above-mentioned visual info web extracting rule of the embodiment of the present invention, each described webpage
Information retrieval action is preset with the independent configuration page for filling configuration parameter.
On the other hand, a kind of construction device of visual info web extracting rule, institute are embodiments provided
Stating device includes:
First acquisition module, for the web page element selected according to user, using web page joint parser webpage is obtained
The parameter information of element, the parameter information includes:The xpath of web page element, attribute and textual value;
Processing module, for according to the web page element parameter information for getting, to corresponding info web extraction action institute
The configuration parameter for needing is filled;
First generation module, in default visual rule action directorial area, extracting to required info web
Action carries out corresponding operating, generates correspondingly info web extracting rule.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
Second generation module, for the info web extracting rule that operation is generated, generates corresponding execution journal and the page
Snapshot, the execution journal is used to record the implementing result of each info web extraction action in info web extracting rule;
Second acquisition module, for according to default proof rule, obtaining corresponding the result, the result is used
In safeguarding page info extracting rule for user.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, described second obtains
Module, including:
First acquisition unit, for parsing to the execution journal for generating, obtains the info web extraction for performing failure
Action;
Second acquisition unit, for using default image comparison technology, the page snapshot of generation and parent page being entered
Row contrast, with the different position of parent page in acquisition page snapshot.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, also include:
Judge module, for judging page extracting rule whether by confirmatory operation according to the result for obtaining;
Release module, for when page extracting rule successfully passes confirmatory operation, webpage extracting rule being published to
In default rule storehouse, batch page info extraction is carried out.
In the construction device of the above-mentioned visual info web extracting rule of the embodiment of the present invention, each described webpage
Information retrieval action is preset with the independent configuration page for filling configuration parameter.
The beneficial effect that technical scheme provided in an embodiment of the present invention is brought is:
By the web page element selected according to user, using web page joint parser the parameter letter of web page element is obtained
Breath;Then according to the web page element parameter information for getting, the configuration parameter needed for corresponding info web extraction action is entered
Row filling;Finally in default visual rule action directorial area, required info web extraction action is accordingly grasped
Make, generate correspondingly info web extracting rule.Analysis of the user to structure of web page had so both been eliminated, the special of user had been reduced
Industry requires, and is user in default visual rule action directorial area, there is provided the info web of convenient operation is extracted
Action management, greatly reduces the difficulty writing and safeguard of the user to info web extracting rule, improves info web and carries
Take the structure efficiency of rule.
Description of the drawings
Technical scheme in order to be illustrated more clearly that the embodiment of the present invention, below will be to making needed for embodiment description
Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for
For those of ordinary skill in the art, on the premise of not paying creative work, can be obtaining other according to these accompanying drawings
Accompanying drawing.
Fig. 1 is the construction method flow process of a kind of visual info web extracting rule that the embodiment of the present invention one is provided
Figure;
Fig. 2 is a kind of example at visualized operation interface that the embodiment of the present invention one is provided;
Fig. 3 is the construction method flow process of a kind of visual info web extracting rule that the embodiment of the present invention one is provided
Figure;
Fig. 4 is that the construction device structure of a kind of visual info web extracting rule that the embodiment of the present invention two is provided is shown
It is intended to;
Fig. 5 is that the construction device structure of a kind of visual info web extracting rule that the embodiment of the present invention two is provided is shown
It is intended to;
Fig. 6 is a kind of structural representation of second acquisition module that the embodiment of the present invention two is provided.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention
Formula is described in further detail.
Embodiment one
A kind of construction method of visual info web extracting rule is embodiments provided, referring to Fig. 1, the party
Method can include:
Step S11, according to the web page element that user selects, using web page joint parser the parameter of web page element is obtained
Information, the parameter information can include:The xpath of web page element, attribute and textual value.
In the present embodiment, the info web that user wants to extract is included in web page element.Xpath is extensible markup
Language (Extensible Markup Language, referred to as " XML ") path language, it be one kind for determining XML document in
The language of certain portion, trees of the xpath based on XML, there is provided the ability of node is looked in data-structure tree.In reality
In the application of border, xpath, attribute and the textual value for obtaining web page element using web page joint parser is prior art, this
In repeat no more.
Step S12, according to the web page element parameter information for getting, matching somebody with somebody needed for corresponding info web extraction action
Put parameter to be filled.
In the present embodiment, info web extracting rule is made up of multiple info web extraction actions, info web
Extraction action is then rules unit independent one by one.Action below by taking the browser of Chromium kernels as an example, to commonly using
Illustrate:
Explanation:If there is checkers in params, certain element being to wait in the page, until meeting certain
Part, if still not up to meet in the range of time-out time required, rule script time-out is exited.
By the above-mentioned info web extraction action for implementing it is recognised that page object can be opened by open actions
Face, by click click actions object element in loading page can be triggered, and by wait actions object element can be waited
Loading is completed, and by data actions target information extraction can be carried out.
In the present embodiment, each info web extraction action needs the web page element according to required extraction, carries out corresponding
Parameter configuration so that info web extraction action can be operated to web page element.
In actual applications, Fig. 2 is a kind of example at visualized operation interface, when user have chosen net in webview areas
During page element, the related parameter information of web page element can be automatically generated in rule editing area, user only needs to therefrom select institute
The parameter of care generates corresponding info web extraction action.
Specifically, each info web extraction action is preset with the independent configuration page for filling configuration parameter.
In the present embodiment, a complete info web extracting rule may extract action group by multiple info webs
Into the focus of every kind of info web extraction action are different, and each info web extraction action needs the info web for obtaining not
Together, the page is configured also just different, it is separate between the different action configuration pages not couple, facilitate follow-up maintenance and extension.
Step S13, in default visual rule action directorial area, to required info web extraction action phase is carried out
Should operate, generate correspondingly info web extracting rule.
In the present embodiment, user can be in visual rule action directorial area as shown in Figure 2, to info web
Extraction action such as is added, inserts, deleting, editing at the operation.When multiple page info extraction actions are with certain execution sequence
When editor completes, correspondingly info web extracting rule has been generated as.
Alternatively, referring to Fig. 3, the method can also include:
Step S14, runs the info web extracting rule for generating, and generates corresponding execution journal and page snapshot, performs
Daily record is used to record the implementing result of each info web extraction action in info web extracting rule.
In the present embodiment, by the construction method of visual info web extracting rule, the info web of generation is carried
Taking rule can carry out corresponding confirmatory operation, and be sentenced using the execution journal and page snapshot of confirmatory operation generation
Whether suspension page information extracting rule is qualified.
Step S15, according to default proof rule, obtains corresponding the result, and the result is used to be safeguarded for user
Page info extracting rule.
Specifically, above-mentioned steps S15 can be realized in the following way:
Execution journal to generating is parsed, and obtains the info web extraction action for performing failure;
Using default image comparison technology, the page snapshot of generation and parent page are contrasted, obtained the page fast
With the different position of parent page according in.
In the present embodiment, if record has the info web extraction action for performing failure in execution journal, or, generate
Page snapshot and parent page have difference, then user can be according to above- mentioned information, in page info extracting rule
The info web extraction action to be changed is safeguarded, and the page info extracting rule after maintenance is re-started confirmatory
Operation, until it successfully passes confirmatory operation.
Alternatively, referring to Fig. 3, the method can also include:
Whether step S16, judge page extracting rule by confirmatory operation according to the result for obtaining.If the page
Extracting rule successfully passes confirmatory operation, then execution step S17, if the success of page extracting rule is not over confirmatory fortune
OK, then user needs to safeguard page info extracting rule accordingly according to the result, and the page after maintenance is believed
Breath extracting rule re-starts confirmatory operation.
Step S17, webpage extracting rule is published in default rule storehouse, carries out batch page info extraction.
The embodiment of the present invention obtains webpage unit by the web page element selected according to user using web page joint parser
The parameter information of element;Then according to the web page element parameter information for getting, to needed for corresponding info web extraction action
Configuration parameter is filled;Finally in default visual rule action directorial area, to required info web extraction action
Corresponding operating is carried out, correspondingly info web extracting rule is generated.Analysis of the user to structure of web page was so both eliminated, had been reduced
The specialty requirement of user, and be user in default visual rule action directorial area, there is provided the net of convenient operation
Page information extracts action management, greatly reduces the difficulty writing and safeguard of the user to info web extracting rule, improves
The structure efficiency of info web extracting rule.
Embodiment two
A kind of construction device of visual info web extracting rule is embodiments provided, embodiment is employed
The construction method of the visual info web extracting rule described in, referring to Fig. 4, the device can include:First obtains mould
Block 100, processing module 200, the first generation module 300.
First acquisition module 100, for the web page element selected according to user, using web page joint parser net is obtained
The parameter information of page element, parameter information includes:The xpath of web page element, attribute and textual value.
In the present embodiment, the info web that user wants to extract is included in web page element.Xpath is XML paths language
Speech, it be it is a kind of for determining XML document in certain portion language, xpath based on XML tree, there is provided counting
According to the ability that node is looked in structure tree.In actual applications, web page element is obtained using web page joint parser
Xpath, attribute and textual value are prior art, are repeated no more here.
Processing module 200, for according to the web page element parameter information for getting, to corresponding info web extraction action
Required configuration parameter is filled.
In the present embodiment, info web extracting rule is made up of multiple info web extraction actions, info web
Extraction action is then rules unit independent one by one.Target pages can be opened by open actions, clicked on by click dynamic
Work can trigger object element in loading page, by wait actions the loading of object element can be waited to complete, by data
Action can carry out target information extraction.
In the present embodiment, each info web extraction action needs the web page element according to required extraction, carries out corresponding
Parameter configuration so that info web extraction action can be operated to web page element.
Specifically, each info web extraction action is preset with the independent configuration page for filling configuration parameter.
In the present embodiment, a complete info web extracting rule may extract action group by multiple info webs
Into the focus of every kind of info web extraction action are different, and each info web extraction action needs the info web for obtaining not
Together, the page is configured also just different, it is separate between the different action configuration pages not couple, facilitate follow-up maintenance and extension.
First generation module 300, in default visual rule action directorial area, carrying to required info web
Taking action carries out corresponding operating, generates correspondingly info web extracting rule.
In the present embodiment, user can enter in visual rule action directorial area to info web extraction action
The operations such as row addition, insertion, deletion, editor.When multiple page info extraction actions are completed with certain execution sequence editor,
It has been generated as correspondingly info web extracting rule.
Alternatively, referring to Fig. 5, the device can also include:Second generation module 400, the second acquisition module 500.
Second generation module 400, for the info web extracting rule that operation is generated, generates corresponding execution journal and page
Face snapshot, execution journal is used to record the implementing result of each info web extraction action in info web extracting rule.
In the present embodiment, by the construction device of visual info web extracting rule, the info web of generation is carried
Taking rule can carry out corresponding confirmatory operation, and be sentenced using the execution journal and page snapshot of confirmatory operation generation
Whether suspension page information extracting rule is qualified.
Second acquisition module 500, for according to default proof rule, obtaining corresponding the result, the result is used
In safeguarding page info extracting rule for user.
Further, referring to Fig. 6, the second acquisition module 500 can include:First acquisition unit 501, second acquisition unit
502。
First acquisition unit 501, for parsing to the execution journal for generating, the info web for obtaining execution failure is carried
Take action.
Second acquisition unit 502, for using default image comparison technology, by the page snapshot for generating and parent page
Contrasted, with the different position of parent page in acquisition page snapshot.
In the present embodiment, if record has the info web extraction action for performing failure in execution journal, or, generate
Page snapshot and parent page have difference, then user can be according to above- mentioned information, in page info extracting rule
The info web extraction action to be changed is safeguarded, and the page info extracting rule after maintenance is re-started confirmatory
Operation, until it successfully passes confirmatory operation.
Alternatively, referring to Fig. 5, the device can also include:Judge module 600, release module 700.
Judge module 600, for judging page extracting rule whether by confirmatory operation according to the result for obtaining.
Release module 700, for when page extracting rule successfully passes confirmatory operation, webpage extracting rule being issued
To in default rule storehouse, batch page info extraction is carried out.
The embodiment of the present invention obtains webpage unit by the web page element selected according to user using web page joint parser
The parameter information of element;Then according to the web page element parameter information for getting, to needed for corresponding info web extraction action
Configuration parameter is filled;Finally in default visual rule action directorial area, to required info web extraction action
Corresponding operating is carried out, correspondingly info web extracting rule is generated.Analysis of the user to structure of web page was so both eliminated, had been reduced
The specialty requirement of user, and be user in default visual rule action directorial area, there is provided the net of convenient operation
Page information extracts action management, greatly reduces the difficulty writing and safeguard of the user to info web extracting rule, improves
The structure efficiency of info web extracting rule.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
It should be noted that:The construction device of the visual info web extracting rule that above-described embodiment is provided is being realized
During the construction method of visual info web extracting rule, only it is illustrated with the division of above-mentioned each functional module, it is real
Border application in, can as desired by above-mentioned functions distribution be completed by different functional modules, will equipment internal structure
Different functional modules are divided into, to complete all or part of function described above.In addition, above-described embodiment provide can
The construction device of info web extracting rule depending on changing belongs to the construction method embodiment of visual info web extracting rule
In same design, it implements process and refers to embodiment of the method, repeats no more here.
One of ordinary skill in the art will appreciate that realizing all or part of step of above-described embodiment can pass through hardware
To complete, it is also possible to which the hardware that correlation is instructed by program is completed, and described program can be stored in a kind of computer-readable
In storage medium, storage medium mentioned above can be read only memory, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, not to limit the present invention, all spirit in the present invention and
Within principle, any modification, equivalent substitution and improvements made etc. should be included within the scope of the present invention.
Claims (10)
1. a kind of construction method of visual info web extracting rule, it is characterised in that methods described includes:
According to the web page element that user selects, using web page joint parser the parameter information of web page element, the ginseng are obtained
Number information includes:The xpath of web page element, attribute and textual value;
According to the web page element parameter information for getting, the configuration parameter needed for corresponding info web extraction action is filled out
Fill;
In default visual rule action directorial area, corresponding operating is carried out to required info web extraction action, generated
Correspondingly info web extracting rule.
2. method according to claim 1, it is characterised in that also include:
The info web extracting rule that operation is generated, generates corresponding execution journal and page snapshot, and the execution journal is used for
The implementing result of each info web extraction action in record info web extracting rule;
According to default proof rule, corresponding the result is obtained, the result is used to safeguard page info for user
Extracting rule.
3. method according to claim 2, it is characterised in that described according to default proof rule, acquisition is tested accordingly
Card result, including:
Execution journal to generating is parsed, and obtains the info web extraction action for performing failure;
Using default image comparison technology, the page snapshot of generation and parent page are contrasted, in obtaining page snapshot
With the different position of parent page.
4. method according to claim 2, it is characterised in that also include:
Whether the result according to obtaining judges page extracting rule by confirmatory operation;
When page extracting rule successfully passes confirmatory operation, webpage extracting rule is published in default rule storehouse, is entered
Row batch page info is extracted.
5. the method according to any one of claim 1-4, it is characterised in that each described info web extraction action is pre-
It is provided with the independent configuration page for filling configuration parameter.
6. a kind of construction device of visual info web extracting rule, it is characterised in that described device includes:
First acquisition module, for the web page element selected according to user, using web page joint parser web page element is obtained
Parameter information, the parameter information includes:The xpath of web page element, attribute and textual value;
Processing module, for according to the web page element parameter information for getting, to needed for corresponding info web extraction action
Configuration parameter is filled;
First generation module, in default visual rule action directorial area, to required info web extraction action
Corresponding operating is carried out, correspondingly info web extracting rule is generated.
7. device according to claim 6, it is characterised in that also include:
Second generation module, for the info web extracting rule that operation is generated, generates corresponding execution journal and page snapshot,
The execution journal is used to record the implementing result of each info web extraction action in info web extracting rule;
Second acquisition module, for according to default proof rule, obtaining corresponding the result, the result is used to supply
User safeguards page info extracting rule.
8. device according to claim 7, it is characterised in that second acquisition module, including:
First acquisition unit, for parsing to the execution journal for generating, obtains the info web extraction action for performing failure;
Second acquisition unit, for using default image comparison technology, it is right that the page snapshot of generation and parent page are carried out
Than with the different position of parent page in acquisition page snapshot.
9. device according to claim 7, it is characterised in that also include:
Judge module, for judging page extracting rule whether by confirmatory operation according to the result for obtaining;
Release module, for when page extracting rule successfully passes confirmatory operation, webpage extracting rule being published to default
Rule base in, carry out batch page info extraction.
10. the device according to any one of claim 6-9, it is characterised in that each described info web extraction action is equal
It is preset with the independent configuration page for filling configuration parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610956895.5A CN106570133B (en) | 2016-10-27 | 2016-10-27 | A kind of construction method and device of visual webpage information extracting rule |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610956895.5A CN106570133B (en) | 2016-10-27 | 2016-10-27 | A kind of construction method and device of visual webpage information extracting rule |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106570133A true CN106570133A (en) | 2017-04-19 |
CN106570133B CN106570133B (en) | 2019-07-23 |
Family
ID=58535373
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610956895.5A Active CN106570133B (en) | 2016-10-27 | 2016-10-27 | A kind of construction method and device of visual webpage information extracting rule |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570133B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107729475A (en) * | 2017-10-16 | 2018-02-23 | 深圳视界信息技术有限公司 | Web page element acquisition method, device, terminal and computer-readable recording medium |
CN108874977A (en) * | 2018-06-08 | 2018-11-23 | 东软集团股份有限公司 | Page data extracting method, device, storage medium and electronic equipment |
CN109657117A (en) * | 2018-11-12 | 2019-04-19 | 厦门市美亚柏科信息股份有限公司 | A kind of extraction method, system and the computer storage medium of webpage element |
CN112597377A (en) * | 2020-12-25 | 2021-04-02 | 北京百度网讯科技有限公司 | Information extraction module generation method, information extraction method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003308275A (en) * | 2002-04-12 | 2003-10-31 | Sharp Corp | System and method for extracting webpage information |
CN103714116A (en) * | 2013-10-31 | 2014-04-09 | 北京奇虎科技有限公司 | Webpage information extracting method and webpage information extracting equipment |
CN104050281A (en) * | 2014-06-26 | 2014-09-17 | 北京思特奇信息技术股份有限公司 | Webpage information extraction method and device based on http protocol |
CN105468730A (en) * | 2015-11-20 | 2016-04-06 | 广州华多网络科技有限公司 | Webpage information extraction method and equipment |
-
2016
- 2016-10-27 CN CN201610956895.5A patent/CN106570133B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003308275A (en) * | 2002-04-12 | 2003-10-31 | Sharp Corp | System and method for extracting webpage information |
CN103714116A (en) * | 2013-10-31 | 2014-04-09 | 北京奇虎科技有限公司 | Webpage information extracting method and webpage information extracting equipment |
CN104050281A (en) * | 2014-06-26 | 2014-09-17 | 北京思特奇信息技术股份有限公司 | Webpage information extraction method and device based on http protocol |
CN105468730A (en) * | 2015-11-20 | 2016-04-06 | 广州华多网络科技有限公司 | Webpage information extraction method and equipment |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107729475A (en) * | 2017-10-16 | 2018-02-23 | 深圳视界信息技术有限公司 | Web page element acquisition method, device, terminal and computer-readable recording medium |
CN108874977A (en) * | 2018-06-08 | 2018-11-23 | 东软集团股份有限公司 | Page data extracting method, device, storage medium and electronic equipment |
CN108874977B (en) * | 2018-06-08 | 2020-11-27 | 东软集团股份有限公司 | Page data extraction method and device, storage medium and electronic equipment |
CN109657117A (en) * | 2018-11-12 | 2019-04-19 | 厦门市美亚柏科信息股份有限公司 | A kind of extraction method, system and the computer storage medium of webpage element |
CN112597377A (en) * | 2020-12-25 | 2021-04-02 | 北京百度网讯科技有限公司 | Information extraction module generation method, information extraction method and device |
Also Published As
Publication number | Publication date |
---|---|
CN106570133B (en) | 2019-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6842167B2 (en) | Summary generator, summary generation method and computer program | |
He et al. | Duplicate bug report detection using dual-channel convolutional neural networks | |
CN104267947B (en) | A kind of editor's method of pop-up picture and pop-up picture editor's device | |
US7386558B2 (en) | Methods and systems for filtering an Extensible Application Markup Language (XAML) file to facilitate indexing of the logical content contained therein | |
US20130305149A1 (en) | Document reader and system for extraction of structural and semantic information from documents | |
CN101464905A (en) | Web page information extraction system and method | |
Huynh et al. | Enabling web browsers to augment web sites' filtering and sorting functionalities | |
Graliński et al. | Kleister: A novel task for information extraction involving long documents with complex layout | |
CN106570133A (en) | Method and device for constructing visual webpage information extracting rule | |
CA2698914A1 (en) | Document segmentation | |
JPH11143912A (en) | Related document display device | |
CN106960058A (en) | A kind of structure of web page alteration detection method and system | |
CN116860987A (en) | Domain knowledge graph construction method and system based on generation type large language model | |
Aumiller et al. | Online dateing: a web interface for temporal annotations | |
EP2599042A1 (en) | Systems and methods of rapid business discovery and transformation of business processes | |
Sanoja et al. | Block-o-matic: a web page segmentation tool and its evaluation | |
KR101104753B1 (en) | Extraction method for hierarchical structure in text contents of structural calculation document | |
Doulani et al. | Analysis of Iranian and British university websites by world wide web consortium. | |
CN107203525A (en) | The treating method and apparatus of database | |
CN110837614A (en) | Method and system for efficiently generating webpage information extraction rule | |
CN112328246A (en) | Page component generation method and device, computer equipment and storage medium | |
Maria et al. | MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications. | |
Sithole et al. | Attributes extraction for fine-grained differentiation of the Internet of Things patterns | |
CN113268412B (en) | Control analysis method, device, equipment and medium for Web system test case recording | |
JP2004303097A (en) | Partial document extraction program and partial document extraction method of structured document |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |