CN108255895A

CN108255895A - A kind of web data acquisition methods using context environmental rule

Info

Publication number: CN108255895A
Application number: CN201611271569.7A
Authority: CN
Inventors: 孙翔
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2018-07-06

Abstract

The invention discloses a kind of web data acquisition methods using context environmental rule, including contents extraction rule and rule matching algorithm, the contents extraction rule is mainly by user according to extracting rule grammer self-defining, contents extraction rule employs tree-shaped after bearing structure, extracting rule grammer is using a kind of condition action grammatical pattern, condition part includes DOM node attribute and context property, and undercarriage includes being classified to the node of matching condition, upgrades context property, using certain specific contents extraction technology.The present invention is by merging a variety of Data Mining key data extractive techniques, and more accurate web data extraction effect is realized on this basis, this method scheme extracting rule syntactic definition is simple, easily study, it is easy to use, it writes efficient, realizes the accurate application of same page Different Extraction Method by rule match conditions, contents extraction quality is higher than existing similar product.

Description

A kind of web data acquisition methods using context environmental rule

Technical field

The present invention relates to Data Mining, specifically a kind of web data acquisition side using context environmental rule Method.

Background technology

Web page contents acquisition is a complicated process, it include determining the page which partly comprising in core text Hold, neglect the content unrelated with rich topic, such as head, footnote, navigation bar, advertisement, wherein most critical in those steps It is to identify core content of text.Identification core text have a wide range of applications, such as generate text index, generation web-page summarization, User to there is the defects of vision provides Homepage reading function, optimised web page contents is provided for the small screen smart machine.At this It is any inside webpage even very small amount of irrelevant information is not filtered can all cause to perplex to the reading of user in a little applications.Mesh Preceding computer circle occurred dedicated for extract web page core content product, as Lixto, Kapowtech, Mozenda, the extraction strategy that these products use is different, and some uses dom tree method, and some uses visual text block side Method, also some use density method；The applicable situation that these methods have oneself different, merely using a kind of method in specific page Ideal contents extraction effect can be not necessarily realized in the extraction in face；It is above-mentioned not how a kind of tool integration is designed Same extractive technique, and can provide webpage different piece should use the discrimination function of desirable technique just seem quite important.

Invention content

The purpose of the present invention is to provide a kind of web data acquisition methods using context environmental rule, in solution State the problem of being proposed in background technology.

To achieve the above object, the present invention provides following technical solution：

A kind of web data acquisition methods using context environmental rule are calculated including contents extraction rule and rule match Method, for the contents extraction rule mainly by user according to extracting rule grammer self-defining, contents extraction rule employs tree Shape includes DOM node attribute after bearing structure, extracting rule grammer using a kind of condition-act grammatical pattern, condition part And context property, DOM node attribute include tag name, node class name, node ID, node fontname, node width attribute, section Some calculated values inside point height attribute and DOM node；Undercarriage includes being classified to the node of matching condition, upgrade Context property, using certain specific contents extraction technology.

As further embodiment of the present invention：The context property mainly has cSection, cBlock, cTitle, cFont、cTextColor、cBackColor。

As further scheme of the invention：The node can be divided into two classes, core content node and noise node.

Compared with prior art, the beneficial effects of the invention are as follows：The present invention is main by merging a variety of Data Minings Data abstraction techniques, and on this basis introduce context property and nodal community function realizes more accurate webpage number According to extraction effect, this method scheme extracting rule syntactic definition is simple, and extracting rule is realized using hierarchal manner, does not need to user To computer-related technologies, easily learn, it is easy to use, it writes efficient, the same page is realized by rule match conditions The accurate application of Different Extraction Method, contents extraction quality are higher than existing similar product.

Description of the drawings

Fig. 1 is the pseudo-code figure implemented using the web data acquisition methods of context environmental rule.

Specific embodiment

Technical scheme of the present invention is described in more detail With reference to embodiment.

A kind of web data acquisition methods using context environmental rule are calculated including contents extraction rule and rule match Method, the contents extraction rule is mainly by user according to extracting rule grammer self-defining；Contents extraction rule employs class Like the tree-shaped after bearing structure of object oriented language, specific, special extracting rule is inherited from general rule and is generated；Extracting rule Grammer is using a kind of condition-act grammatical pattern；Condition part includes DOM node attribute and context property, DOM node Attribute include tag name, node class name, node ID, node fontname, node width attribute, height of node attribute in addition including Some calculated values inside DOM node, such as the internal picture number included, character string quantity, text size, link density. Context property is used to describing environment residing for node, and context property mainly has cSection, cBlock, cTitle, cFont、cTextColor、cBackColor.Undercarriage includes being classified to the node of matching condition, upgrades context category Property, using certain specific contents extraction technology.Two classes can be divided by matching the node of a certain condition, core content node and Noise node.The formal definitions of extracting rule grammer are as follows：

NodeClass, (Action1, Action2 ...) ← (Context1, Context2 ...) (NodeProp1, NodeProp2 ..)

NodeClass is node-classification

Actioni is concrete action

Contexti is context property

NodePropi is DOM node attribute

After user presses the good contents extraction rule of above-mentioned syntactic definition, these inputs of rule as rule matching algorithm are right Node in webpage dom tree is traversed, each node control content extracting rule, and the node for belonging to noise node classification is thrown It abandons, performs corresponding context property upgrading, if the content of noise node is cited in father node performs matching rule The reference in father node is then deleted simultaneously；Belong to core content node then to upgrade corresponding context property and hold the node The method for extracting content that pre-defines of row, after all nodes in dom tree are traversed, in the web page core text won Appearance is exported as a result by rule matching algorithm.

It is obvious to a person skilled in the art that the present invention is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requirement rather than above description limit, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included within the present invention.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in each embodiment can also be properly combined, forms those skilled in the art The other embodiment being appreciated that.

Claims

1. a kind of web data acquisition methods using context environmental rule, which is characterized in that including contents extraction rule and Rule matching algorithm, the contents extraction rule is mainly by user according to extracting rule grammer self-defining, contents extraction rule Then employ tree-shaped after bearing structure, extracting rule grammer includes using a kind of condition-act grammatical pattern, condition part DOM node attribute and context property, DOM node attribute include tag name, node class name, node ID, node fontname, node Some calculated values inside width attribute, height of node attribute and DOM node；Undercarriage include to matching condition node into Row classification upgrades context property, using certain specific contents extraction technology.

2. a kind of web data acquisition methods using context environmental rule according to claim 1, which is characterized in that The context property mainly has cSection, cBlock, cTitle, cFont, cTextColor, cBackColor.

3. a kind of web data acquisition methods using context environmental rule according to claim 1, which is characterized in that The node can be divided into two classes, core content node and noise node.