CN105512107A - Internet regular text page title identification method based on vision - Google Patents
Internet regular text page title identification method based on vision Download PDFInfo
- Publication number
- CN105512107A CN105512107A CN201510918241.9A CN201510918241A CN105512107A CN 105512107 A CN105512107 A CN 105512107A CN 201510918241 A CN201510918241 A CN 201510918241A CN 105512107 A CN105512107 A CN 105512107A
- Authority
- CN
- China
- Prior art keywords
- text page
- title
- internet
- identification method
- regular text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The invention provides an internet regular text page title identification method based on vision. According to the method, input is a DOM (document object model) tree object downloaded and rendered by a Chrome kernel, due to the fact that a DOM tree contains rendered style information of webpage elements, analysis is convenient, each HTML (hypertext markup language) element is subjected to a necessity weight judgment and proportion weight calculation, and a title in a regular text page is obtained and an Element object which is the most possible title of the regular text page is output. An identification mode of a human body is simulated during operation, and title elements in the internet regular text page can be identified and distinguished efficiently and accurately.
Description
Technical field
The present invention relates to the technical field of internet information acquisition, is a kind of specifically
the internet text page header identification method of view-based access control model.
Background technology
Along with the development of internet, large data acquisition and digging technology are also at development.Therefore, how the data of magnanimity in internet are discarded the dross and selected the essential, obtain wherein valuable content, just become an important technology point in large data technique.
In internet, valuable information concentrates on the text page of website usually, such as title, text, time, author etc.Wherein, title division is as the summary of text page content and abstract, and in whole webpage, having maximum quantity of information, carry out semantic analysis to title, is most worthy.Therefore, first we need, from complete text page, to extract Title area and title content.People, in the face of text page, are easy to article title to extract.But, because internet data amount is huge, manually carry out information extraction high cost and efficiency, accuracy rate are all limited.
Summary of the invention
The technical problem to be solved in the present invention is to provide one
the internet text page header identification method of view-based access control model.
The technical scheme that the present invention takes for the technical matters existed in solution known technology is:
Of the present invention
the internet text page header identification method of view-based access control model, conclude the feature of title element and corresponding weights, and be divided into necessary power and ratio power according to characteristic type; Using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods; Adopt the mode of pre-reset mechanism, each element in dom tree is traveled through; For each html element element, first judge whether satisfied necessary power condition, satisfied then weigh in proportion and calculate the score for this html element element, finally get html element element the highest in whole score and corresponding text page title.
The present invention can also adopt following technical measures:
Feature and the corresponding weights of title element are shown in
table 1:
table 1. the feature of title element and corresponding weights
The advantage that the present invention has and good effect are:
Of the present invention
the internet text page header identification method of view-based access control modelin, be input as through the download of Chrome kernel and the dom tree object after playing up, because dom tree includes the style information after playing up of web page element, be convenient to analyze, by to each html element element, carry out necessary power and to judge and ratio power calculates, to draw title in text page and output is the Element object of most probable text title, the present invention is in operation the recognition method of simulating human, can identify efficiently and accurately and title element in distinguishing Internet text page.
Accompanying drawing explanation
fig. 1of the present invention
the internet text page header identification method of view-based access control modelflow process signal
figure.
Embodiment
Below by way of specific embodiment, the present invention is described in detail.
as Fig. 1shown in, this is of the present invention
the internet text page header identification method of view-based access control model, conclude the feature of title element and corresponding weights, and be divided into necessary power and ratio power according to characteristic type, using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods, adopt the mode of pre-reset mechanism, each element in dom tree root is traveled through, for each html element element, namely each node in dom tree, first judge whether satisfied necessary power condition, the node element not meeting necessary power condition can be got rid of and is title and the calculating no longer carrying out ratio power, satisfied then in proportion power calculate the score for this html element element, as the mark of present node, the best result that present node mark and html element element before calculate is compared, if higher than, retain present node score as new best result, if lower than the former best result of reservation, pointed next node carries out the calculating of next html element element, thus finally obtain html element element node the highest in whole score and corresponding text page title.
Feature and the corresponding weights of title element are shown in
table 1:
table 1. the feature of title element and corresponding weights
When the element of dom tree is traveled through, father's element of element can be got.If father's element of element is H1 or H3 label, then give corresponding weights.H1, H2, H3, H4 are respectively the heading label in HTML, and wherein remember that H1 is most important title, H2 is secondary column or title, subhead, and H3 wants column or classification subhead again, and H4 is subhead of classifying in literary composition.
The pattern-recognition flow process that the present invention proposes and dimension also extend to other elemental recognitions of text page, such as time, author, text region etc.
The above, it is only preferred embodiment of the present invention, not any pro forma restriction is done to the present invention, although the present invention with preferred embodiment openly as above, but, and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical solution of the present invention, certainly the technology contents of announcement can be utilized to make a little change or modification, become the Equivalent embodiments of equivalent variations, in every case be the content not departing from technical solution of the present invention, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all belong in the scope of technical solution of the present invention.
Claims (2)
1. an internet text page header identification method for view-based access control model, is characterized in that: conclude the feature of title element and corresponding weights, and is divided into necessary power and ratio power according to characteristic type; Using text page through browser Chrome kernel download and dom tree object after playing up as the input of recognition methods; Adopt the mode of pre-reset mechanism, each element in dom tree is traveled through; For each html element element, first judge whether satisfied necessary power condition, satisfied then weigh in proportion and calculate the score for this html element element, finally get html element element the highest in whole score and corresponding text page title.
2. the internet text page header identification method of view-based access control model according to claim 1, is characterized in that: the feature of title element and corresponding weights are in table 1:
The feature of table 1. title element and corresponding weights
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510918241.9A CN105512107A (en) | 2015-12-10 | 2015-12-10 | Internet regular text page title identification method based on vision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510918241.9A CN105512107A (en) | 2015-12-10 | 2015-12-10 | Internet regular text page title identification method based on vision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105512107A true CN105512107A (en) | 2016-04-20 |
Family
ID=55720100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510918241.9A Pending CN105512107A (en) | 2015-12-10 | 2015-12-10 | Internet regular text page title identification method based on vision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512107A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050050086A1 (en) * | 2003-08-08 | 2005-03-03 | Fujitsu Limited | Apparatus and method for multimedia object retrieval |
CN102722489A (en) * | 2011-03-30 | 2012-10-10 | 株式会社理光 | System and method for extracting object identifier from webpage |
CN102737122A (en) * | 2012-06-08 | 2012-10-17 | 浙江大学 | Method for extracting verification code image from webpage |
CN103942211A (en) * | 2013-01-21 | 2014-07-23 | 腾讯科技(深圳)有限公司 | Text page recognition method and device |
-
2015
- 2015-12-10 CN CN201510918241.9A patent/CN105512107A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050050086A1 (en) * | 2003-08-08 | 2005-03-03 | Fujitsu Limited | Apparatus and method for multimedia object retrieval |
CN102722489A (en) * | 2011-03-30 | 2012-10-10 | 株式会社理光 | System and method for extracting object identifier from webpage |
CN102737122A (en) * | 2012-06-08 | 2012-10-17 | 浙江大学 | Method for extracting verification code image from webpage |
CN103942211A (en) * | 2013-01-21 | 2014-07-23 | 腾讯科技(深圳)有限公司 | Text page recognition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102541874B (en) | Webpage text content extracting method and device | |
CN105630941B (en) | Web body matter abstracting methods based on statistics and structure of web page | |
CN105912625B (en) | A kind of entity classification method and system towards link data | |
CN111783394B (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN102930031B (en) | By the method and system extracting bilingual parallel text in webpage | |
CN104598577B (en) | A kind of extracting method of Web page text | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN103853834B (en) | Text structure analysis-based Web document abstract generation method | |
CN103077190A (en) | Hot event ranking method based on order learning technology | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
CN110110075A (en) | Web page classification method, device and computer readable storage medium | |
CN105975478A (en) | Word vector analysis-based online article belonging event detection method and device | |
CN105956052A (en) | Building method of knowledge map based on vertical field | |
CN103077164A (en) | Text analysis method and text analyzer | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN106294350A (en) | A kind of text polymerization and device | |
CN104298665A (en) | Identification method and device of evaluation objects of Chinese texts | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN104424308A (en) | Web page classification standard acquisition method and device and web page classification method and device | |
CN104881458A (en) | Labeling method and device for web page topics | |
CN104199846A (en) | Comment subject term clustering method based on Wikipedia | |
CN106055667A (en) | Method for extracting core content of webpage based on text-tag density | |
CN101968801A (en) | Method for extracting key words of single text | |
CN105512333A (en) | Product comment theme searching method based on emotional tendency | |
CN103810251A (en) | Method and device for extracting text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat Applicant after: Tianjin mass information technology Limited by Share Ltd Address before: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat Applicant before: Tianjin Hylanda Information Technology Co.,Ltd. |
|
COR | Change of bibliographic data | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160420 |