CN105512107A - Internet regular text page title identification method based on vision - Google Patents

Internet regular text page title identification method based on vision Download PDF

Info

Publication number
CN105512107A
CN105512107A CN201510918241.9A CN201510918241A CN105512107A CN 105512107 A CN105512107 A CN 105512107A CN 201510918241 A CN201510918241 A CN 201510918241A CN 105512107 A CN105512107 A CN 105512107A
Authority
CN
China
Prior art keywords
text page
title
internet
identification method
regular text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510918241.9A
Other languages
Chinese (zh)
Inventor
李天与
杨伟锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Original Assignee
TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD filed Critical TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Priority to CN201510918241.9A priority Critical patent/CN105512107A/en
Publication of CN105512107A publication Critical patent/CN105512107A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an internet regular text page title identification method based on vision. According to the method, input is a DOM (document object model) tree object downloaded and rendered by a Chrome kernel, due to the fact that a DOM tree contains rendered style information of webpage elements, analysis is convenient, each HTML (hypertext markup language) element is subjected to a necessity weight judgment and proportion weight calculation, and a title in a regular text page is obtained and an Element object which is the most possible title of the regular text page is output. An identification mode of a human body is simulated during operation, and title elements in the internet regular text page can be identified and distinguished efficiently and accurately.

Description

The internet text page header identification method of view-based access control model
Technical field
The present invention relates to the technical field of internet information acquisition, is a kind of specifically the internet text page header identification method of view-based access control model.
Background technology
Along with the development of internet, large data acquisition and digging technology are also at development.Therefore, how the data of magnanimity in internet are discarded the dross and selected the essential, obtain wherein valuable content, just become an important technology point in large data technique.
In internet, valuable information concentrates on the text page of website usually, such as title, text, time, author etc.Wherein, title division is as the summary of text page content and abstract, and in whole webpage, having maximum quantity of information, carry out semantic analysis to title, is most worthy.Therefore, first we need, from complete text page, to extract Title area and title content.People, in the face of text page, are easy to article title to extract.But, because internet data amount is huge, manually carry out information extraction high cost and efficiency, accuracy rate are all limited.
Summary of the invention
The technical problem to be solved in the present invention is to provide one the internet text page header identification method of view-based access control model.
The technical scheme that the present invention takes for the technical matters existed in solution known technology is:
Of the present invention the internet text page header identification method of view-based access control model, conclude the feature of title element and corresponding weights, and be divided into necessary power and ratio power according to characteristic type; Using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods; Adopt the mode of pre-reset mechanism, each element in dom tree is traveled through; For each html element element, first judge whether satisfied necessary power condition, satisfied then weigh in proportion and calculate the score for this html element element, finally get html element element the highest in whole score and corresponding text page title.
The present invention can also adopt following technical measures:
Feature and the corresponding weights of title element are shown in table 1:
table 1. the feature of title element and corresponding weights
The advantage that the present invention has and good effect are:
Of the present invention the internet text page header identification method of view-based access control modelin, be input as through the download of Chrome kernel and the dom tree object after playing up, because dom tree includes the style information after playing up of web page element, be convenient to analyze, by to each html element element, carry out necessary power and to judge and ratio power calculates, to draw title in text page and output is the Element object of most probable text title, the present invention is in operation the recognition method of simulating human, can identify efficiently and accurately and title element in distinguishing Internet text page.
Accompanying drawing explanation
fig. 1of the present invention the internet text page header identification method of view-based access control modelflow process signal figure.
Embodiment
Below by way of specific embodiment, the present invention is described in detail.
as Fig. 1shown in, this is of the present invention the internet text page header identification method of view-based access control model, conclude the feature of title element and corresponding weights, and be divided into necessary power and ratio power according to characteristic type, using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods, adopt the mode of pre-reset mechanism, each element in dom tree root is traveled through, for each html element element, namely each node in dom tree, first judge whether satisfied necessary power condition, the node element not meeting necessary power condition can be got rid of and is title and the calculating no longer carrying out ratio power, satisfied then in proportion power calculate the score for this html element element, as the mark of present node, the best result that present node mark and html element element before calculate is compared, if higher than, retain present node score as new best result, if lower than the former best result of reservation, pointed next node carries out the calculating of next html element element, thus finally obtain html element element node the highest in whole score and corresponding text page title.
Feature and the corresponding weights of title element are shown in table 1:
table 1. the feature of title element and corresponding weights
When the element of dom tree is traveled through, father's element of element can be got.If father's element of element is H1 or H3 label, then give corresponding weights.H1, H2, H3, H4 are respectively the heading label in HTML, and wherein remember that H1 is most important title, H2 is secondary column or title, subhead, and H3 wants column or classification subhead again, and H4 is subhead of classifying in literary composition.
The pattern-recognition flow process that the present invention proposes and dimension also extend to other elemental recognitions of text page, such as time, author, text region etc.
The above, it is only preferred embodiment of the present invention, not any pro forma restriction is done to the present invention, although the present invention with preferred embodiment openly as above, but, and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical solution of the present invention, certainly the technology contents of announcement can be utilized to make a little change or modification, become the Equivalent embodiments of equivalent variations, in every case be the content not departing from technical solution of the present invention, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all belong in the scope of technical solution of the present invention.

Claims (2)

1. an internet text page header identification method for view-based access control model, is characterized in that: conclude the feature of title element and corresponding weights, and is divided into necessary power and ratio power according to characteristic type; Using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods; Adopt the mode of pre-reset mechanism, each element in dom tree is traveled through; For each html element element, first judge whether satisfied necessary power condition, satisfied then weigh in proportion and calculate the score for this html element element, finally get html element element the highest in whole score and corresponding text page title.
2. the internet text page header identification method of view-based access control model according to claim 1, is characterized in that: the feature of title element and corresponding weights are in table 1:
The feature of table 1. title element and corresponding weights
CN201510918241.9A 2015-12-10 2015-12-10 Internet regular text page title identification method based on vision Pending CN105512107A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510918241.9A CN105512107A (en) 2015-12-10 2015-12-10 Internet regular text page title identification method based on vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510918241.9A CN105512107A (en) 2015-12-10 2015-12-10 Internet regular text page title identification method based on vision

Publications (1)

Publication Number Publication Date
CN105512107A true CN105512107A (en) 2016-04-20

Family

ID=55720100

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510918241.9A Pending CN105512107A (en) 2015-12-10 2015-12-10 Internet regular text page title identification method based on vision

Country Status (1)

Country Link
CN (1) CN105512107A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050050086A1 (en) * 2003-08-08 2005-03-03 Fujitsu Limited Apparatus and method for multimedia object retrieval
CN102722489A (en) * 2011-03-30 2012-10-10 株式会社理光 System and method for extracting object identifier from webpage
CN102737122A (en) * 2012-06-08 2012-10-17 浙江大学 Method for extracting verification code image from webpage
CN103942211A (en) * 2013-01-21 2014-07-23 腾讯科技(深圳)有限公司 Text page recognition method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050050086A1 (en) * 2003-08-08 2005-03-03 Fujitsu Limited Apparatus and method for multimedia object retrieval
CN102722489A (en) * 2011-03-30 2012-10-10 株式会社理光 System and method for extracting object identifier from webpage
CN102737122A (en) * 2012-06-08 2012-10-17 浙江大学 Method for extracting verification code image from webpage
CN103942211A (en) * 2013-01-21 2014-07-23 腾讯科技(深圳)有限公司 Text page recognition method and device

Similar Documents

Publication Publication Date Title
CN102541874B (en) Webpage text content extracting method and device
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
CN105912625B (en) A kind of entity classification method and system towards link data
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN102930031B (en) By the method and system extracting bilingual parallel text in webpage
CN104598577B (en) A kind of extracting method of Web page text
CN103336766B (en) Short text garbage identification and modeling method and device
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN103077190A (en) Hot event ranking method based on order learning technology
CN102270206A (en) Method and device for capturing valid web page contents
CN110110075A (en) Web page classification method, device and computer readable storage medium
CN105975478A (en) Word vector analysis-based online article belonging event detection method and device
CN105956052A (en) Building method of knowledge map based on vertical field
CN103077164A (en) Text analysis method and text analyzer
CN107992542A (en) A kind of similar article based on topic model recommends method
CN106294350A (en) A kind of text polymerization and device
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
CN104881458A (en) Labeling method and device for web page topics
CN104199846A (en) Comment subject term clustering method based on Wikipedia
CN106055667A (en) Method for extracting core content of webpage based on text-tag density
CN101968801A (en) Method for extracting key words of single text
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN103810251A (en) Method and device for extracting text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat

Applicant after: Tianjin mass information technology Limited by Share Ltd

Address before: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat

Applicant before: Tianjin Hylanda Information Technology Co.,Ltd.

COR Change of bibliographic data
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160420