CN108255895A - A kind of web data acquisition methods using context environmental rule - Google Patents

A kind of web data acquisition methods using context environmental rule Download PDF

Info

Publication number
CN108255895A
CN108255895A CN201611271569.7A CN201611271569A CN108255895A CN 108255895 A CN108255895 A CN 108255895A CN 201611271569 A CN201611271569 A CN 201611271569A CN 108255895 A CN108255895 A CN 108255895A
Authority
CN
China
Prior art keywords
rule
node
contents extraction
context
web data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611271569.7A
Other languages
Chinese (zh)
Inventor
孙翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201611271569.7A priority Critical patent/CN108255895A/en
Publication of CN108255895A publication Critical patent/CN108255895A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24575Query processing with adaptation to user needs using context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a kind of web data acquisition methods using context environmental rule, including contents extraction rule and rule matching algorithm, the contents extraction rule is mainly by user according to extracting rule grammer self-defining, contents extraction rule employs tree-shaped after bearing structure, extracting rule grammer is using a kind of condition action grammatical pattern, condition part includes DOM node attribute and context property, and undercarriage includes being classified to the node of matching condition, upgrades context property, using certain specific contents extraction technology.The present invention is by merging a variety of Data Mining key data extractive techniques, and more accurate web data extraction effect is realized on this basis, this method scheme extracting rule syntactic definition is simple, easily study, it is easy to use, it writes efficient, realizes the accurate application of same page Different Extraction Method by rule match conditions, contents extraction quality is higher than existing similar product.

Description

A kind of web data acquisition methods using context environmental rule
Technical field
The present invention relates to Data Mining, specifically a kind of web data acquisition side using context environmental rule Method.
Background technology
Web page contents acquisition is a complicated process, it include determining the page which partly comprising in core text Hold, neglect the content unrelated with rich topic, such as head, footnote, navigation bar, advertisement, wherein most critical in those steps It is to identify core content of text.Identification core text have a wide range of applications, such as generate text index, generation web-page summarization, User to there is the defects of vision provides Homepage reading function, optimised web page contents is provided for the small screen smart machine.At this It is any inside webpage even very small amount of irrelevant information is not filtered can all cause to perplex to the reading of user in a little applications.Mesh Preceding computer circle occurred dedicated for extract web page core content product, as Lixto, Kapowtech, Mozenda, the extraction strategy that these products use is different, and some uses dom tree method, and some uses visual text block side Method, also some use density method;The applicable situation that these methods have oneself different, merely using a kind of method in specific page Ideal contents extraction effect can be not necessarily realized in the extraction in face;It is above-mentioned not how a kind of tool integration is designed Same extractive technique, and can provide webpage different piece should use the discrimination function of desirable technique just seem quite important.
Invention content
The purpose of the present invention is to provide a kind of web data acquisition methods using context environmental rule, in solution State the problem of being proposed in background technology.
To achieve the above object, the present invention provides following technical solution:
A kind of web data acquisition methods using context environmental rule are calculated including contents extraction rule and rule match Method, for the contents extraction rule mainly by user according to extracting rule grammer self-defining, contents extraction rule employs tree Shape includes DOM node attribute after bearing structure, extracting rule grammer using a kind of condition-act grammatical pattern, condition part And context property, DOM node attribute include tag name, node class name, node ID, node fontname, node width attribute, section Some calculated values inside point height attribute and DOM node;Undercarriage includes being classified to the node of matching condition, upgrade Context property, using certain specific contents extraction technology.
As further embodiment of the present invention:The context property mainly has cSection, cBlock, cTitle, cFont、cTextColor、cBackColor。
As further scheme of the invention:The node can be divided into two classes, core content node and noise node.
Compared with prior art, the beneficial effects of the invention are as follows:The present invention is main by merging a variety of Data Minings Data abstraction techniques, and on this basis introduce context property and nodal community function realizes more accurate webpage number According to extraction effect, this method scheme extracting rule syntactic definition is simple, and extracting rule is realized using hierarchal manner, does not need to user To computer-related technologies, easily learn, it is easy to use, it writes efficient, the same page is realized by rule match conditions The accurate application of Different Extraction Method, contents extraction quality are higher than existing similar product.
Description of the drawings
Fig. 1 is the pseudo-code figure implemented using the web data acquisition methods of context environmental rule.
Specific embodiment
Technical scheme of the present invention is described in more detail With reference to embodiment.
A kind of web data acquisition methods using context environmental rule are calculated including contents extraction rule and rule match Method, the contents extraction rule is mainly by user according to extracting rule grammer self-defining;Contents extraction rule employs class Like the tree-shaped after bearing structure of object oriented language, specific, special extracting rule is inherited from general rule and is generated;Extracting rule Grammer is using a kind of condition-act grammatical pattern;Condition part includes DOM node attribute and context property, DOM node Attribute include tag name, node class name, node ID, node fontname, node width attribute, height of node attribute in addition including Some calculated values inside DOM node, such as the internal picture number included, character string quantity, text size, link density. Context property is used to describing environment residing for node, and context property mainly has cSection, cBlock, cTitle, cFont、cTextColor、cBackColor.Undercarriage includes being classified to the node of matching condition, upgrades context category Property, using certain specific contents extraction technology.Two classes can be divided by matching the node of a certain condition, core content node and Noise node.The formal definitions of extracting rule grammer are as follows:
NodeClass, (Action1, Action2 ...) ← (Context1, Context2 ...) (NodeProp1, NodeProp2 ..)
NodeClass is node-classification
Actioni is concrete action
Contexti is context property
NodePropi is DOM node attribute
After user presses the good contents extraction rule of above-mentioned syntactic definition, these inputs of rule as rule matching algorithm are right Node in webpage dom tree is traversed, each node control content extracting rule, and the node for belonging to noise node classification is thrown It abandons, performs corresponding context property upgrading, if the content of noise node is cited in father node performs matching rule The reference in father node is then deleted simultaneously;Belong to core content node then to upgrade corresponding context property and hold the node The method for extracting content that pre-defines of row, after all nodes in dom tree are traversed, in the web page core text won Appearance is exported as a result by rule matching algorithm.
It is obvious to a person skilled in the art that the present invention is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requirement rather than above description limit, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included within the present invention.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in each embodiment can also be properly combined, forms those skilled in the art The other embodiment being appreciated that.

Claims (3)

1. a kind of web data acquisition methods using context environmental rule, which is characterized in that including contents extraction rule and Rule matching algorithm, the contents extraction rule is mainly by user according to extracting rule grammer self-defining, contents extraction rule Then employ tree-shaped after bearing structure, extracting rule grammer includes using a kind of condition-act grammatical pattern, condition part DOM node attribute and context property, DOM node attribute include tag name, node class name, node ID, node fontname, node Some calculated values inside width attribute, height of node attribute and DOM node;Undercarriage include to matching condition node into Row classification upgrades context property, using certain specific contents extraction technology.
2. a kind of web data acquisition methods using context environmental rule according to claim 1, which is characterized in that The context property mainly has cSection, cBlock, cTitle, cFont, cTextColor, cBackColor.
3. a kind of web data acquisition methods using context environmental rule according to claim 1, which is characterized in that The node can be divided into two classes, core content node and noise node.
CN201611271569.7A 2016-12-29 2016-12-29 A kind of web data acquisition methods using context environmental rule Pending CN108255895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611271569.7A CN108255895A (en) 2016-12-29 2016-12-29 A kind of web data acquisition methods using context environmental rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611271569.7A CN108255895A (en) 2016-12-29 2016-12-29 A kind of web data acquisition methods using context environmental rule

Publications (1)

Publication Number Publication Date
CN108255895A true CN108255895A (en) 2018-07-06

Family

ID=62722053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611271569.7A Pending CN108255895A (en) 2016-12-29 2016-12-29 A kind of web data acquisition methods using context environmental rule

Country Status (1)

Country Link
CN (1) CN108255895A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061955A (en) * 2019-12-20 2020-04-24 深圳市朱墨科技有限公司 Webpage text extraction method, device, server and storage medium
CN111931097A (en) * 2020-09-24 2020-11-13 腾讯科技(深圳)有限公司 Information display method and device, electronic equipment and storage medium
CN114254068A (en) * 2022-02-28 2022-03-29 杭州未名信科科技有限公司 Data transfer method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061955A (en) * 2019-12-20 2020-04-24 深圳市朱墨科技有限公司 Webpage text extraction method, device, server and storage medium
CN111061955B (en) * 2019-12-20 2023-11-07 深圳市朱墨科技有限公司 Webpage text extraction method and device, server and storage medium
CN111931097A (en) * 2020-09-24 2020-11-13 腾讯科技(深圳)有限公司 Information display method and device, electronic equipment and storage medium
CN111931097B (en) * 2020-09-24 2021-01-05 腾讯科技(深圳)有限公司 Information display method and device, electronic equipment and storage medium
CN114254068A (en) * 2022-02-28 2022-03-29 杭州未名信科科技有限公司 Data transfer method and system

Similar Documents

Publication Publication Date Title
CN102253979B (en) Vision-based web page extracting method
CN102156737B (en) Method for extracting subject content of Chinese webpage
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
CN104408137B (en) A kind of network statistics map visualization data preparation method
CN107590219A (en) Webpage personage subject correlation message extracting method
CN102122280B (en) Method and system for intelligently extracting content object
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
CN106557565A (en) A kind of text message extracting method based on website construction
CN103294664A (en) Method and system for discovering new words in open fields
CN106446072A (en) Webpage content processing method and apparatus
CN103123624A (en) Method of confirming head word, device of confirming head word, searching method and device
CN108255895A (en) A kind of web data acquisition methods using context environmental rule
CN108804469A (en) A kind of web page identification method and electronic equipment
CN104123336B (en) Depth Boltzmann machine model and short text subject classification system and method
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN106528068A (en) Webpage content reconstruction method and system
CN108959204A (en) Internet monetary items information extraction method and system
JP2013033473A5 (en)
Eldirdiery et al. Detecting and removing noisy data on web document using text density approach
Palekar et al. Deep web data extraction using web-programming-language-independent approach
Furche et al. Amber: Automatic supervision for multi-attribute extraction
CN107451215A (en) Feature text abstracting method and device
CN107145947A (en) A kind of information processing method, device and electronic equipment
CN102982029A (en) Identification method and device for searching requirement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180706