CN106227882A - A kind of accessible web page navigation method extracted based on navigation object - Google Patents

A kind of accessible web page navigation method extracted based on navigation object Download PDF

Info

Publication number
CN106227882A
CN106227882A CN201610635259.2A CN201610635259A CN106227882A CN 106227882 A CN106227882 A CN 106227882A CN 201610635259 A CN201610635259 A CN 201610635259A CN 106227882 A CN106227882 A CN 106227882A
Authority
CN
China
Prior art keywords
hyperlink
group
navigation
navigation object
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610635259.2A
Other languages
Chinese (zh)
Other versions
CN106227882B (en
Inventor
王灿
钊魁
卜佳俊
陈纯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610635259.2A priority Critical patent/CN106227882B/en
Publication of CN106227882A publication Critical patent/CN106227882A/en
Application granted granted Critical
Publication of CN106227882B publication Critical patent/CN106227882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The accessible web page navigation method extracted based on navigation object, captures after webpage from the Internet, proceeds as follows for each webpage: be first document object model tree this web analysis;Then document object model tree is carried out depth-first traversal, be that the node in document object model tree numbers in order with natural number;Representing the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gathering apart near hyperlink is one group;The number of hyperlink, the average of hyperlink text length and variance is calculated for each hyperlink group;With hyperlink number, hyperlink text length average is characterized with variance, formalization representation hyperlink group;Finally the hyperlink group after formalization representation is classified, be divided into navigation object and non-navigational object.Advantage of the process is that the navigation object that can automatically extract in webpage, help user quickly to navigate between webpage, improve user's degree of embodiment.

Description

A kind of accessible web page navigation method extracted based on navigation object
Technical field
The present invention relates to the technical field of accessible web page navigation method, be based particularly on the accessible of navigation object extraction Web page navigation method.
Background technology
There are about about blind person 30,000,000 people in the world, China there are about 5,000,000 blind persons, accounts for the 18% of the world total, along with mutually The continuous rising of the highly popular and the Internet importance in daily life of networking, blind person's indulging in the internet will become accessible building Major issue in if.Blind person is owing to cannot accept information by vision, so its indulging in the internet is particularly pertinent.And it is present Web site contents is more and more abundanter, and website structure is complicated, especially some Large-Scale Interconnecteds net enterprise, such as the portal such as Sina, Sohu Standing, the webpage comprised has reached millions of.In the face of the website that content page quantity is the hugest, provide a kind of nothing for blind person Obstacle web page navigation method is particularly important.
User browses conveniently for convenience, and the most a lot of webpages provide the navigation objects such as navigation bar and navigating lists, user With the help of these navigation objects, navigate to oneself need browsing pages.But blind person is owing to cannot accept letter by vision Breath, it is impossible to utilize navigation object to navigate to the page oneself needing to browse.Rely on aids such as reading screen software for needs For browsing the blind users of webpage, the navigation object in webpage is extracted, contribute to quickly navigating between webpage, improve Webpage browsing efficiency.
At present, in fields such as machine learning, feature is extracted and the research day of sorting algorithm on this basis Gradually ripe.In terms of webpage extraction, it is used to represent needs such as statistical information such as text average lengths the most widely and extracts Content, and the content representation extracted will be needed to become computable vector form.Basis after needing the content-form extracted On, the sorting algorithm such as the existing such as SVM of machine learning, can will need the different classification of the interior Rongcheng that extracts.
Summary of the invention
The disadvantages mentioned above of present invention prior art to be overcome, it is proposed that a kind of accessible webpage extracted based on navigation object Air navigation aid, in order to automatically extract the navigation object in webpage, helps user quickly to navigate between webpage, improves user and embodies Degree.
The accessible web page navigation method extracted based on navigation object of the present invention, comprises the following steps:
1, capture after webpage from the Internet, carry out following operation for each webpage:
1) be document object model tree this web analysis, document object model tree carried out depth-first traversal, with from So number is that the node in document object model tree numbers in order;
2) represent the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gather apart near hyperlink It it is one group;
3) it is each hyperlink group CiCalculate hyperlink number Ni, hyperlink text length averageWith varianceForm Change ground by hyperlink group CiIt is expressed as vector
4) utilize sorting algorithm that the hyperlink group after formalization representation is classified, be divided into navigation object and non-navigational pair As.
Step 2) described in hyperlink be polymerized specifically: calculate the distance between all adjacent hyperlink, from big to small It is D=[d after sequence1,d2,...,dl], from 1 to l choose i and makeMinimum, wherein β is adjustable parameter, diIt is The spacing judging two hyperlink is to be closely remote threshold value t, if the distance between two hyperlink is not more than t, then and two Distance between hyperlink is near, and otherwise the distance between two hyperlink is remote.
Step 3) described in hyperlink group CiFormalization representation:
31) hyperlink text length averageWherein total (words) is CiIn all hyperlink The word number of text;
32) hyperlink text length varianceWherein total (uij) it is CiMiddle hyperlink Meet uijThe word number of Chinese version.
The present invention proposes the accessible web page navigation method extracted based on navigation object, has an advantage in that: to webpage certainly Dynamicization extracts navigation object, helps user quickly to navigate between webpage;It is applicable to all types of webpage, it is not necessary to backstage is manually grasped Make, can be used for helping blind person to realize accessible web page navigation.
Accompanying drawing explanation
Fig. 1 is the method flow diagram of the present invention.
Detailed description of the invention
Referring to the drawings, the present invention is further illustrated:
A kind of accessible Web browser method based on link clustering, the method comprises the following steps:
1, capture after webpage from the Internet, carry out following operation for each webpage:
1) be document object model tree this web analysis, document object model tree carried out depth-first traversal, with from So number is that the node in document object model tree numbers in order;
2) represent the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gather apart near hyperlink It it is one group;
3) it is each hyperlink group CiCalculate hyperlink number Ni, hyperlink text length averageWith varianceShape Formulaization ground is by hyperlink group CiIt is expressed as vector
4) utilize svm classifier algorithm that the hyperlink group after formalization representation is classified, be divided into navigation object to lead with non- Boat object.
Step 2) described in hyperlink be polymerized specifically: calculate the distance between all adjacent hyperlink, from big to small It is D=[d after sequence1,d2,...,dl], from 1 to l choose i and makeMinimum, wherein β is adjustable parameter, diIt is to sentence The spacing of disconnected two hyperlink is to be closely remote threshold value t, if the distance between two hyperlink is not more than t, then two surpass Distance between link is near, and otherwise the distance between two hyperlink is remote.
Step 3) described in hyperlink group CiFormalization representation:
31) hyperlink text length averageWherein total (words) is CiIn all hyperlink The word number of text;
32) hyperlink text length varianceWherein total (uij) it is CiMiddle hyperlink Meet uijThe word number of Chinese version.
Step 4) described in svm classifier algorithm, specifically:
41) collecting some webpages, manually mark the navigation object in webpage, the collections of web pages marked constitutes instruction Practice collection;
42) P=[P1,P2,...,PN], Y=[Y1,Y2,...,YN], wherein PiIt it is hyperlink group C in training setiForm Change and represent, if hyperlink group CiIt is noted as navigation object and then has Yi=1, if hyperlink group CiIt is not labeled as navigating right As then there being Yi=0;
43) on P Yu Y, perform SVM learning algorithm, generate disaggregated model;
44) in hyperlink group C to be sortedkFormalization representation PkUpper operation disaggregated model, obtains exporting YkIf, Yk =1 hyperlink group CkIt is navigation object, if Yk=0 link-group CkIt it not navigation object.
Content described in this specification embodiment is only enumerating of the way of realization to inventive concept, the protection of the present invention Being not construed as of scope is only limitted to the concrete form that embodiment is stated, protection scope of the present invention is also and in this area skill Art personnel according to present inventive concept it is conceivable that equivalent technologies means.

Claims (4)

1. the accessible web page navigation method extracted based on navigation object, the method is characterized in that and capture net from the Internet After Ye, carry out following operation for each webpage:
1) it is document object model tree this web analysis, document object model tree is carried out depth-first traversal, uses natural number Number in order for the node in document object model tree;
2) representing the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gathering apart near hyperlink is one Group;
3) it is each hyperlink group CiCalculate hyperlink number Ni, hyperlink text length averageWith varianceFormally By hyperlink group CiIt is expressed as vector
4) utilize sorting algorithm that the hyperlink group after formalization representation is classified, be divided into navigation object and non-navigational object.
2. the accessible web page navigation method extracted based on navigation object as claimed in claim 1, it is characterised in that: described Step 2) described in hyperlink be polymerized specifically: calculate the distance between all adjacent hyperlink, be D after sequence from big to small =[d1,d2,…,dl], from 1 to l choose i and makeMinimum, wherein β is adjustable parameter, diIt is to judge two hyperlinks The spacing connect is to be closely remote threshold value t, if the distance between two hyperlink is not more than t, then between two hyperlink Distance is near, and otherwise the distance between two hyperlink is remote.
3. the accessible web page navigation method extracted based on navigation object as claimed in claim 1, it is characterised in that: described Step 3) described in hyperlink group CiFormalization representation:
31) hyperlink text length averageWherein total (words) is CiIn all hyperlink text Word number;
32) hyperlink text length varianceWherein total (uij) it is CiMiddle hyperlink uij The word number of Chinese version.
4. the accessible web page navigation method extracted based on navigation object as claimed in claim 1, it is characterised in that: step 4) Described in svm classifier algorithm, specifically:
41) collect some webpages, the navigation object in webpage is manually marked, the collections of web pages composing training marked Collection;
42) P=[P1,P2,...,PN], Y=[Y1,Y2,…,YN], wherein PiIt it is hyperlink group C in training setiFormalization table Show, if hyperlink group CiIt is noted as navigation object and then has Yi=1, if hyperlink group CiIt is not labeled as navigation object then There is Yi=0;
43) on P Yu Y, perform SVM learning algorithm, generate disaggregated model;
44) in hyperlink group C to be sortedkFormalization representation PkUpper operation disaggregated model, obtains exporting YkIf, Yk=1 Hyperlink group CkIt is navigation object, if Yk=0 link-group CkIt it not navigation object.
CN201610635259.2A 2016-08-02 2016-08-02 A kind of accessible web page navigation method extracted based on navigation object Active CN106227882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610635259.2A CN106227882B (en) 2016-08-02 2016-08-02 A kind of accessible web page navigation method extracted based on navigation object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610635259.2A CN106227882B (en) 2016-08-02 2016-08-02 A kind of accessible web page navigation method extracted based on navigation object

Publications (2)

Publication Number Publication Date
CN106227882A true CN106227882A (en) 2016-12-14
CN106227882B CN106227882B (en) 2019-08-23

Family

ID=57546904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610635259.2A Active CN106227882B (en) 2016-08-02 2016-08-02 A kind of accessible web page navigation method extracted based on navigation object

Country Status (1)

Country Link
CN (1) CN106227882B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246494A (en) * 2008-03-19 2008-08-20 腾讯科技(深圳)有限公司 Internet web page conversion method, system and equipment
US20100003645A1 (en) * 2008-07-02 2010-01-07 Moresteam.Com Llc Education method and tool
CN102799638A (en) * 2012-06-25 2012-11-28 浙江大学 In-page navigation generation method facing barrier-free access to webpage contents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246494A (en) * 2008-03-19 2008-08-20 腾讯科技(深圳)有限公司 Internet web page conversion method, system and equipment
US20100003645A1 (en) * 2008-07-02 2010-01-07 Moresteam.Com Llc Education method and tool
CN102799638A (en) * 2012-06-25 2012-11-28 浙江大学 In-page navigation generation method facing barrier-free access to webpage contents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张兵等: "基于超链接和DOM结构树的网页标题实时抽取方法", 《计算机与现代化》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model

Also Published As

Publication number Publication date
CN106227882B (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN103455613B (en) Based on the interest aware service recommendation method of MapReduce model
CN101388022B (en) Web portrait search method for fusing text semantic and vision content
CN103226578B (en) Towards the website identification of medical domain and the method for webpage disaggregated classification
CN104199822B (en) It is a kind of to identify the method and system for searching for corresponding demand classification
CN107463658B (en) Text classification method and device
CN103577466B (en) Method and device for displaying webpage content in browser
CN104598577B (en) A kind of extracting method of Web page text
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN103136358B (en) A kind of method of Automatic Extraction forum data
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN104408093A (en) News event element extracting method and device
CN102693304B (en) Search engine feedback information processing method and search engine
CN103077164A (en) Text analysis method and text analyzer
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN101777060A (en) Automatic evaluation method and system of webpage visual quality
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN105573979B (en) A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character
CN101515272A (en) Method and device for extracting webpage content
CN109543126A (en) Web page text information extracting method based on block text accounting
CN106156372A (en) The sorting technique of a kind of internet site and device
CN107402916A (en) The segmenting method and device of Chinese text
CN108874996A (en) website classification method and device
CN104598648B (en) A kind of microblog users interactive mode gender identification method and device
CN105183715A (en) Word distribution and document feature based automatic classification method for spam comments
CN102799638B (en) In-page navigation generation method facing barrier-free access to webpage contents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant