CN108255895A - A kind of web data acquisition methods using context environmental rule - Google Patents
A kind of web data acquisition methods using context environmental rule Download PDFInfo
- Publication number
- CN108255895A CN108255895A CN201611271569.7A CN201611271569A CN108255895A CN 108255895 A CN108255895 A CN 108255895A CN 201611271569 A CN201611271569 A CN 201611271569A CN 108255895 A CN108255895 A CN 108255895A
- Authority
- CN
- China
- Prior art keywords
- rule
- node
- contents extraction
- context
- web data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24575—Query processing with adaptation to user needs using context
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The invention discloses a kind of web data acquisition methods using context environmental rule, including contents extraction rule and rule matching algorithm, the contents extraction rule is mainly by user according to extracting rule grammer self-defining, contents extraction rule employs tree-shaped after bearing structure, extracting rule grammer is using a kind of condition action grammatical pattern, condition part includes DOM node attribute and context property, and undercarriage includes being classified to the node of matching condition, upgrades context property, using certain specific contents extraction technology.The present invention is by merging a variety of Data Mining key data extractive techniques, and more accurate web data extraction effect is realized on this basis, this method scheme extracting rule syntactic definition is simple, easily study, it is easy to use, it writes efficient, realizes the accurate application of same page Different Extraction Method by rule match conditions, contents extraction quality is higher than existing similar product.
Description
Technical field
The present invention relates to Data Mining, specifically a kind of web data acquisition side using context environmental rule
Method.
Background technology
Web page contents acquisition is a complicated process, it include determining the page which partly comprising in core text
Hold, neglect the content unrelated with rich topic, such as head, footnote, navigation bar, advertisement, wherein most critical in those steps
It is to identify core content of text.Identification core text have a wide range of applications, such as generate text index, generation web-page summarization,
User to there is the defects of vision provides Homepage reading function, optimised web page contents is provided for the small screen smart machine.At this
It is any inside webpage even very small amount of irrelevant information is not filtered can all cause to perplex to the reading of user in a little applications.Mesh
Preceding computer circle occurred dedicated for extract web page core content product, as Lixto, Kapowtech,
Mozenda, the extraction strategy that these products use is different, and some uses dom tree method, and some uses visual text block side
Method, also some use density method;The applicable situation that these methods have oneself different, merely using a kind of method in specific page
Ideal contents extraction effect can be not necessarily realized in the extraction in face;It is above-mentioned not how a kind of tool integration is designed
Same extractive technique, and can provide webpage different piece should use the discrimination function of desirable technique just seem quite important.
Invention content
The purpose of the present invention is to provide a kind of web data acquisition methods using context environmental rule, in solution
State the problem of being proposed in background technology.
To achieve the above object, the present invention provides following technical solution:
A kind of web data acquisition methods using context environmental rule are calculated including contents extraction rule and rule match
Method, for the contents extraction rule mainly by user according to extracting rule grammer self-defining, contents extraction rule employs tree
Shape includes DOM node attribute after bearing structure, extracting rule grammer using a kind of condition-act grammatical pattern, condition part
And context property, DOM node attribute include tag name, node class name, node ID, node fontname, node width attribute, section
Some calculated values inside point height attribute and DOM node;Undercarriage includes being classified to the node of matching condition, upgrade
Context property, using certain specific contents extraction technology.
As further embodiment of the present invention:The context property mainly has cSection, cBlock, cTitle,
cFont、cTextColor、cBackColor。
As further scheme of the invention:The node can be divided into two classes, core content node and noise node.
Compared with prior art, the beneficial effects of the invention are as follows:The present invention is main by merging a variety of Data Minings
Data abstraction techniques, and on this basis introduce context property and nodal community function realizes more accurate webpage number
According to extraction effect, this method scheme extracting rule syntactic definition is simple, and extracting rule is realized using hierarchal manner, does not need to user
To computer-related technologies, easily learn, it is easy to use, it writes efficient, the same page is realized by rule match conditions
The accurate application of Different Extraction Method, contents extraction quality are higher than existing similar product.
Description of the drawings
Fig. 1 is the pseudo-code figure implemented using the web data acquisition methods of context environmental rule.
Specific embodiment
Technical scheme of the present invention is described in more detail With reference to embodiment.
A kind of web data acquisition methods using context environmental rule are calculated including contents extraction rule and rule match
Method, the contents extraction rule is mainly by user according to extracting rule grammer self-defining;Contents extraction rule employs class
Like the tree-shaped after bearing structure of object oriented language, specific, special extracting rule is inherited from general rule and is generated;Extracting rule
Grammer is using a kind of condition-act grammatical pattern;Condition part includes DOM node attribute and context property, DOM node
Attribute include tag name, node class name, node ID, node fontname, node width attribute, height of node attribute in addition including
Some calculated values inside DOM node, such as the internal picture number included, character string quantity, text size, link density.
Context property is used to describing environment residing for node, and context property mainly has cSection, cBlock, cTitle,
cFont、cTextColor、cBackColor.Undercarriage includes being classified to the node of matching condition, upgrades context category
Property, using certain specific contents extraction technology.Two classes can be divided by matching the node of a certain condition, core content node and
Noise node.The formal definitions of extracting rule grammer are as follows:
NodeClass, (Action1, Action2 ...) ← (Context1, Context2 ...) (NodeProp1,
NodeProp2 ..)
NodeClass is node-classification
Actioni is concrete action
Contexti is context property
NodePropi is DOM node attribute
After user presses the good contents extraction rule of above-mentioned syntactic definition, these inputs of rule as rule matching algorithm are right
Node in webpage dom tree is traversed, each node control content extracting rule, and the node for belonging to noise node classification is thrown
It abandons, performs corresponding context property upgrading, if the content of noise node is cited in father node performs matching rule
The reference in father node is then deleted simultaneously;Belong to core content node then to upgrade corresponding context property and hold the node
The method for extracting content that pre-defines of row, after all nodes in dom tree are traversed, in the web page core text won
Appearance is exported as a result by rule matching algorithm.
It is obvious to a person skilled in the art that the present invention is not limited to the details of above-mentioned exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Profit requirement rather than above description limit, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims
Variation is included within the present invention.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in each embodiment can also be properly combined, forms those skilled in the art
The other embodiment being appreciated that.
Claims (3)
1. a kind of web data acquisition methods using context environmental rule, which is characterized in that including contents extraction rule and
Rule matching algorithm, the contents extraction rule is mainly by user according to extracting rule grammer self-defining, contents extraction rule
Then employ tree-shaped after bearing structure, extracting rule grammer includes using a kind of condition-act grammatical pattern, condition part
DOM node attribute and context property, DOM node attribute include tag name, node class name, node ID, node fontname, node
Some calculated values inside width attribute, height of node attribute and DOM node;Undercarriage include to matching condition node into
Row classification upgrades context property, using certain specific contents extraction technology.
2. a kind of web data acquisition methods using context environmental rule according to claim 1, which is characterized in that
The context property mainly has cSection, cBlock, cTitle, cFont, cTextColor, cBackColor.
3. a kind of web data acquisition methods using context environmental rule according to claim 1, which is characterized in that
The node can be divided into two classes, core content node and noise node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611271569.7A CN108255895A (en) | 2016-12-29 | 2016-12-29 | A kind of web data acquisition methods using context environmental rule |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611271569.7A CN108255895A (en) | 2016-12-29 | 2016-12-29 | A kind of web data acquisition methods using context environmental rule |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108255895A true CN108255895A (en) | 2018-07-06 |
Family
ID=62722053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611271569.7A Pending CN108255895A (en) | 2016-12-29 | 2016-12-29 | A kind of web data acquisition methods using context environmental rule |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108255895A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061955A (en) * | 2019-12-20 | 2020-04-24 | 深圳市朱墨科技有限公司 | Webpage text extraction method, device, server and storage medium |
CN111931097A (en) * | 2020-09-24 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Information display method and device, electronic equipment and storage medium |
CN114254068A (en) * | 2022-02-28 | 2022-03-29 | 杭州未名信科科技有限公司 | Data transfer method and system |
-
2016
- 2016-12-29 CN CN201611271569.7A patent/CN108255895A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061955A (en) * | 2019-12-20 | 2020-04-24 | 深圳市朱墨科技有限公司 | Webpage text extraction method, device, server and storage medium |
CN111061955B (en) * | 2019-12-20 | 2023-11-07 | 深圳市朱墨科技有限公司 | Webpage text extraction method and device, server and storage medium |
CN111931097A (en) * | 2020-09-24 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Information display method and device, electronic equipment and storage medium |
CN111931097B (en) * | 2020-09-24 | 2021-01-05 | 腾讯科技(深圳)有限公司 | Information display method and device, electronic equipment and storage medium |
CN114254068A (en) * | 2022-02-28 | 2022-03-29 | 杭州未名信科科技有限公司 | Data transfer method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102253979B (en) | Vision-based web page extracting method | |
CN102156737B (en) | Method for extracting subject content of Chinese webpage | |
WO2019041521A1 (en) | Apparatus and method for extracting user keyword, and computer-readable storage medium | |
CN104408137B (en) | A kind of network statistics map visualization data preparation method | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
CN102122280B (en) | Method and system for intelligently extracting content object | |
CN104268160A (en) | Evaluation object extraction method based on domain dictionary and semantic roles | |
CN102298638A (en) | Method and system for extracting news webpage contents by clustering webpage labels | |
CN106557565A (en) | A kind of text message extracting method based on website construction | |
CN103294664A (en) | Method and system for discovering new words in open fields | |
CN106446072A (en) | Webpage content processing method and apparatus | |
CN103123624A (en) | Method of confirming head word, device of confirming head word, searching method and device | |
CN108255895A (en) | A kind of web data acquisition methods using context environmental rule | |
CN108804469A (en) | A kind of web page identification method and electronic equipment | |
CN104123336B (en) | Depth Boltzmann machine model and short text subject classification system and method | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method | |
CN106528068A (en) | Webpage content reconstruction method and system | |
CN108959204A (en) | Internet monetary items information extraction method and system | |
JP2013033473A5 (en) | ||
Eldirdiery et al. | Detecting and removing noisy data on web document using text density approach | |
Palekar et al. | Deep web data extraction using web-programming-language-independent approach | |
Furche et al. | Amber: Automatic supervision for multi-attribute extraction | |
CN107451215A (en) | Feature text abstracting method and device | |
CN107145947A (en) | A kind of information processing method, device and electronic equipment | |
CN102982029A (en) | Identification method and device for searching requirement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180706 |