CN105512107A

CN105512107A - Internet regular text page title identification method based on vision

Info

Publication number: CN105512107A
Application number: CN201510918241.9A
Authority: CN
Inventors: 李天与; 杨伟锋
Original assignee: TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Current assignee: TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Priority date: 2015-12-10
Filing date: 2015-12-10
Publication date: 2016-04-20

Abstract

The invention provides an internet regular text page title identification method based on vision. According to the method, input is a DOM (document object model) tree object downloaded and rendered by a Chrome kernel, due to the fact that a DOM tree contains rendered style information of webpage elements, analysis is convenient, each HTML (hypertext markup language) element is subjected to a necessity weight judgment and proportion weight calculation, and a title in a regular text page is obtained and an Element object which is the most possible title of the regular text page is output. An identification mode of a human body is simulated during operation, and title elements in the internet regular text page can be identified and distinguished efficiently and accurately.

Description

The internet text page header identification method of view-based access control model

Technical field

The present invention relates to the technical field of internet information acquisition, is a kind of specifically the internet text page header identification method of view-based access control model.

Background technology

Along with the development of internet, large data acquisition and digging technology are also at development.Therefore, how the data of magnanimity in internet are discarded the dross and selected the essential, obtain wherein valuable content, just become an important technology point in large data technique.

In internet, valuable information concentrates on the text page of website usually, such as title, text, time, author etc.Wherein, title division is as the summary of text page content and abstract, and in whole webpage, having maximum quantity of information, carry out semantic analysis to title, is most worthy.Therefore, first we need, from complete text page, to extract Title area and title content.People, in the face of text page, are easy to article title to extract.But, because internet data amount is huge, manually carry out information extraction high cost and efficiency, accuracy rate are all limited.

Summary of the invention

The technical problem to be solved in the present invention is to provide one the internet text page header identification method of view-based access control model.

The technical scheme that the present invention takes for the technical matters existed in solution known technology is:

Of the present invention the internet text page header identification method of view-based access control model, conclude the feature of title element and corresponding weights, and be divided into necessary power and ratio power according to characteristic type; Using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods; Adopt the mode of pre-reset mechanism, each element in dom tree is traveled through; For each html element element, first judge whether satisfied necessary power condition, satisfied then weigh in proportion and calculate the score for this html element element, finally get html element element the highest in whole score and corresponding text page title.

The present invention can also adopt following technical measures:

Feature and the corresponding weights of title element are shown in table 1:

table 1. the feature of title element and corresponding weights

The advantage that the present invention has and good effect are:

Of the present invention the internet text page header identification method of view-based access control modelin, be input as through the download of Chrome kernel and the dom tree object after playing up, because dom tree includes the style information after playing up of web page element, be convenient to analyze, by to each html element element, carry out necessary power and to judge and ratio power calculates, to draw title in text page and output is the Element object of most probable text title, the present invention is in operation the recognition method of simulating human, can identify efficiently and accurately and title element in distinguishing Internet text page.

Accompanying drawing explanation

fig. 1of the present invention the internet text page header identification method of view-based access control modelflow process signal figure.

Embodiment

Below by way of specific embodiment, the present invention is described in detail.

as Fig. 1shown in, this is of the present invention the internet text page header identification method of view-based access control model, conclude the feature of title element and corresponding weights, and be divided into necessary power and ratio power according to characteristic type, using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods, adopt the mode of pre-reset mechanism, each element in dom tree root is traveled through, for each html element element, namely each node in dom tree, first judge whether satisfied necessary power condition, the node element not meeting necessary power condition can be got rid of and is title and the calculating no longer carrying out ratio power, satisfied then in proportion power calculate the score for this html element element, as the mark of present node, the best result that present node mark and html element element before calculate is compared, if higher than, retain present node score as new best result, if lower than the former best result of reservation, pointed next node carries out the calculating of next html element element, thus finally obtain html element element node the highest in whole score and corresponding text page title.

table 1. the feature of title element and corresponding weights

When the element of dom tree is traveled through, father's element of element can be got.If father's element of element is H1 or H3 label, then give corresponding weights.H1, H2, H3, H4 are respectively the heading label in HTML, and wherein remember that H1 is most important title, H2 is secondary column or title, subhead, and H3 wants column or classification subhead again, and H4 is subhead of classifying in literary composition.

The pattern-recognition flow process that the present invention proposes and dimension also extend to other elemental recognitions of text page, such as time, author, text region etc.

The above, it is only preferred embodiment of the present invention, not any pro forma restriction is done to the present invention, although the present invention with preferred embodiment openly as above, but, and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical solution of the present invention, certainly the technology contents of announcement can be utilized to make a little change or modification, become the Equivalent embodiments of equivalent variations, in every case be the content not departing from technical solution of the present invention, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all belong in the scope of technical solution of the present invention.

Claims

1. an internet text page header identification method for view-based access control model, is characterized in that: conclude the feature of title element and corresponding weights, and is divided into necessary power and ratio power according to characteristic type; Using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods; Adopt the mode of pre-reset mechanism, each element in dom tree is traveled through; For each html element element, first judge whether satisfied necessary power condition, satisfied then weigh in proportion and calculate the score for this html element element, finally get html element element the highest in whole score and corresponding text page title.

2. the internet text page header identification method of view-based access control model according to claim 1, is characterized in that: the feature of title element and corresponding weights are in table 1:

The feature of table 1. title element and corresponding weights