CN106227882A - A kind of accessible web page navigation method extracted based on navigation object - Google Patents
A kind of accessible web page navigation method extracted based on navigation object Download PDFInfo
- Publication number
- CN106227882A CN106227882A CN201610635259.2A CN201610635259A CN106227882A CN 106227882 A CN106227882 A CN 106227882A CN 201610635259 A CN201610635259 A CN 201610635259A CN 106227882 A CN106227882 A CN 106227882A
- Authority
- CN
- China
- Prior art keywords
- hyperlink
- group
- navigation
- navigation object
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
The accessible web page navigation method extracted based on navigation object, captures after webpage from the Internet, proceeds as follows for each webpage: be first document object model tree this web analysis;Then document object model tree is carried out depth-first traversal, be that the node in document object model tree numbers in order with natural number;Representing the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gathering apart near hyperlink is one group;The number of hyperlink, the average of hyperlink text length and variance is calculated for each hyperlink group;With hyperlink number, hyperlink text length average is characterized with variance, formalization representation hyperlink group;Finally the hyperlink group after formalization representation is classified, be divided into navigation object and non-navigational object.Advantage of the process is that the navigation object that can automatically extract in webpage, help user quickly to navigate between webpage, improve user's degree of embodiment.
Description
Technical field
The present invention relates to the technical field of accessible web page navigation method, be based particularly on the accessible of navigation object extraction
Web page navigation method.
Background technology
There are about about blind person 30,000,000 people in the world, China there are about 5,000,000 blind persons, accounts for the 18% of the world total, along with mutually
The continuous rising of the highly popular and the Internet importance in daily life of networking, blind person's indulging in the internet will become accessible building
Major issue in if.Blind person is owing to cannot accept information by vision, so its indulging in the internet is particularly pertinent.And it is present
Web site contents is more and more abundanter, and website structure is complicated, especially some Large-Scale Interconnecteds net enterprise, such as the portal such as Sina, Sohu
Standing, the webpage comprised has reached millions of.In the face of the website that content page quantity is the hugest, provide a kind of nothing for blind person
Obstacle web page navigation method is particularly important.
User browses conveniently for convenience, and the most a lot of webpages provide the navigation objects such as navigation bar and navigating lists, user
With the help of these navigation objects, navigate to oneself need browsing pages.But blind person is owing to cannot accept letter by vision
Breath, it is impossible to utilize navigation object to navigate to the page oneself needing to browse.Rely on aids such as reading screen software for needs
For browsing the blind users of webpage, the navigation object in webpage is extracted, contribute to quickly navigating between webpage, improve
Webpage browsing efficiency.
At present, in fields such as machine learning, feature is extracted and the research day of sorting algorithm on this basis
Gradually ripe.In terms of webpage extraction, it is used to represent needs such as statistical information such as text average lengths the most widely and extracts
Content, and the content representation extracted will be needed to become computable vector form.Basis after needing the content-form extracted
On, the sorting algorithm such as the existing such as SVM of machine learning, can will need the different classification of the interior Rongcheng that extracts.
Summary of the invention
The disadvantages mentioned above of present invention prior art to be overcome, it is proposed that a kind of accessible webpage extracted based on navigation object
Air navigation aid, in order to automatically extract the navigation object in webpage, helps user quickly to navigate between webpage, improves user and embodies
Degree.
The accessible web page navigation method extracted based on navigation object of the present invention, comprises the following steps:
1, capture after webpage from the Internet, carry out following operation for each webpage:
1) be document object model tree this web analysis, document object model tree carried out depth-first traversal, with from
So number is that the node in document object model tree numbers in order;
2) represent the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gather apart near hyperlink
It it is one group;
3) it is each hyperlink group CiCalculate hyperlink number Ni, hyperlink text length averageWith varianceForm
Change ground by hyperlink group CiIt is expressed as vector
4) utilize sorting algorithm that the hyperlink group after formalization representation is classified, be divided into navigation object and non-navigational pair
As.
Step 2) described in hyperlink be polymerized specifically: calculate the distance between all adjacent hyperlink, from big to small
It is D=[d after sequence1,d2,...,dl], from 1 to l choose i and makeMinimum, wherein β is adjustable parameter, diIt is
The spacing judging two hyperlink is to be closely remote threshold value t, if the distance between two hyperlink is not more than t, then and two
Distance between hyperlink is near, and otherwise the distance between two hyperlink is remote.
Step 3) described in hyperlink group CiFormalization representation:
31) hyperlink text length averageWherein total (words) is CiIn all hyperlink
The word number of text;
32) hyperlink text length varianceWherein total (uij) it is CiMiddle hyperlink
Meet uijThe word number of Chinese version.
The present invention proposes the accessible web page navigation method extracted based on navigation object, has an advantage in that: to webpage certainly
Dynamicization extracts navigation object, helps user quickly to navigate between webpage;It is applicable to all types of webpage, it is not necessary to backstage is manually grasped
Make, can be used for helping blind person to realize accessible web page navigation.
Accompanying drawing explanation
Fig. 1 is the method flow diagram of the present invention.
Detailed description of the invention
Referring to the drawings, the present invention is further illustrated:
A kind of accessible Web browser method based on link clustering, the method comprises the following steps:
1, capture after webpage from the Internet, carry out following operation for each webpage:
1) be document object model tree this web analysis, document object model tree carried out depth-first traversal, with from
So number is that the node in document object model tree numbers in order;
2) represent the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gather apart near hyperlink
It it is one group;
3) it is each hyperlink group CiCalculate hyperlink number Ni, hyperlink text length averageWith varianceShape
Formulaization ground is by hyperlink group CiIt is expressed as vector
4) utilize svm classifier algorithm that the hyperlink group after formalization representation is classified, be divided into navigation object to lead with non-
Boat object.
Step 2) described in hyperlink be polymerized specifically: calculate the distance between all adjacent hyperlink, from big to small
It is D=[d after sequence1,d2,...,dl], from 1 to l choose i and makeMinimum, wherein β is adjustable parameter, diIt is to sentence
The spacing of disconnected two hyperlink is to be closely remote threshold value t, if the distance between two hyperlink is not more than t, then two surpass
Distance between link is near, and otherwise the distance between two hyperlink is remote.
Step 3) described in hyperlink group CiFormalization representation:
31) hyperlink text length averageWherein total (words) is CiIn all hyperlink
The word number of text;
32) hyperlink text length varianceWherein total (uij) it is CiMiddle hyperlink
Meet uijThe word number of Chinese version.
Step 4) described in svm classifier algorithm, specifically:
41) collecting some webpages, manually mark the navigation object in webpage, the collections of web pages marked constitutes instruction
Practice collection;
42) P=[P1,P2,...,PN], Y=[Y1,Y2,...,YN], wherein PiIt it is hyperlink group C in training setiForm
Change and represent, if hyperlink group CiIt is noted as navigation object and then has Yi=1, if hyperlink group CiIt is not labeled as navigating right
As then there being Yi=0;
43) on P Yu Y, perform SVM learning algorithm, generate disaggregated model;
44) in hyperlink group C to be sortedkFormalization representation PkUpper operation disaggregated model, obtains exporting YkIf, Yk
=1 hyperlink group CkIt is navigation object, if Yk=0 link-group CkIt it not navigation object.
Content described in this specification embodiment is only enumerating of the way of realization to inventive concept, the protection of the present invention
Being not construed as of scope is only limitted to the concrete form that embodiment is stated, protection scope of the present invention is also and in this area skill
Art personnel according to present inventive concept it is conceivable that equivalent technologies means.
Claims (4)
1. the accessible web page navigation method extracted based on navigation object, the method is characterized in that and capture net from the Internet
After Ye, carry out following operation for each webpage:
1) it is document object model tree this web analysis, document object model tree is carried out depth-first traversal, uses natural number
Number in order for the node in document object model tree;
2) representing the distance between hyperlink with the numbering absolute difference that two hyperlink are corresponding, gathering apart near hyperlink is one
Group;
3) it is each hyperlink group CiCalculate hyperlink number Ni, hyperlink text length averageWith varianceFormally
By hyperlink group CiIt is expressed as vector
4) utilize sorting algorithm that the hyperlink group after formalization representation is classified, be divided into navigation object and non-navigational object.
2. the accessible web page navigation method extracted based on navigation object as claimed in claim 1, it is characterised in that: described
Step 2) described in hyperlink be polymerized specifically: calculate the distance between all adjacent hyperlink, be D after sequence from big to small
=[d1,d2,…,dl], from 1 to l choose i and makeMinimum, wherein β is adjustable parameter, diIt is to judge two hyperlinks
The spacing connect is to be closely remote threshold value t, if the distance between two hyperlink is not more than t, then between two hyperlink
Distance is near, and otherwise the distance between two hyperlink is remote.
3. the accessible web page navigation method extracted based on navigation object as claimed in claim 1, it is characterised in that: described
Step 3) described in hyperlink group CiFormalization representation:
31) hyperlink text length averageWherein total (words) is CiIn all hyperlink text
Word number;
32) hyperlink text length varianceWherein total (uij) it is CiMiddle hyperlink uij
The word number of Chinese version.
4. the accessible web page navigation method extracted based on navigation object as claimed in claim 1, it is characterised in that: step 4)
Described in svm classifier algorithm, specifically:
41) collect some webpages, the navigation object in webpage is manually marked, the collections of web pages composing training marked
Collection;
42) P=[P1,P2,...,PN], Y=[Y1,Y2,…,YN], wherein PiIt it is hyperlink group C in training setiFormalization table
Show, if hyperlink group CiIt is noted as navigation object and then has Yi=1, if hyperlink group CiIt is not labeled as navigation object then
There is Yi=0;
43) on P Yu Y, perform SVM learning algorithm, generate disaggregated model;
44) in hyperlink group C to be sortedkFormalization representation PkUpper operation disaggregated model, obtains exporting YkIf, Yk=1
Hyperlink group CkIt is navigation object, if Yk=0 link-group CkIt it not navigation object.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610635259.2A CN106227882B (en) | 2016-08-02 | 2016-08-02 | A kind of accessible web page navigation method extracted based on navigation object |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610635259.2A CN106227882B (en) | 2016-08-02 | 2016-08-02 | A kind of accessible web page navigation method extracted based on navigation object |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106227882A true CN106227882A (en) | 2016-12-14 |
CN106227882B CN106227882B (en) | 2019-08-23 |
Family
ID=57546904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610635259.2A Active CN106227882B (en) | 2016-08-02 | 2016-08-02 | A kind of accessible web page navigation method extracted based on navigation object |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106227882B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108733405A (en) * | 2017-04-13 | 2018-11-02 | 富士通株式会社 | The method and apparatus that training webpage distribution indicates model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246494A (en) * | 2008-03-19 | 2008-08-20 | 腾讯科技(深圳)有限公司 | Internet web page conversion method, system and equipment |
US20100003645A1 (en) * | 2008-07-02 | 2010-01-07 | Moresteam.Com Llc | Education method and tool |
CN102799638A (en) * | 2012-06-25 | 2012-11-28 | 浙江大学 | In-page navigation generation method facing barrier-free access to webpage contents |
-
2016
- 2016-08-02 CN CN201610635259.2A patent/CN106227882B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246494A (en) * | 2008-03-19 | 2008-08-20 | 腾讯科技(深圳)有限公司 | Internet web page conversion method, system and equipment |
US20100003645A1 (en) * | 2008-07-02 | 2010-01-07 | Moresteam.Com Llc | Education method and tool |
CN102799638A (en) * | 2012-06-25 | 2012-11-28 | 浙江大学 | In-page navigation generation method facing barrier-free access to webpage contents |
Non-Patent Citations (1)
Title |
---|
张兵等: "基于超链接和DOM结构树的网页标题实时抽取方法", 《计算机与现代化》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108733405A (en) * | 2017-04-13 | 2018-11-02 | 富士通株式会社 | The method and apparatus that training webpage distribution indicates model |
Also Published As
Publication number | Publication date |
---|---|
CN106227882B (en) | 2019-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103455613B (en) | Based on the interest aware service recommendation method of MapReduce model | |
CN101388022B (en) | Web portrait search method for fusing text semantic and vision content | |
CN103226578B (en) | Towards the website identification of medical domain and the method for webpage disaggregated classification | |
CN104199822B (en) | It is a kind of to identify the method and system for searching for corresponding demand classification | |
CN107463658B (en) | Text classification method and device | |
CN103577466B (en) | Method and device for displaying webpage content in browser | |
CN104598577B (en) | A kind of extracting method of Web page text | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN103136358B (en) | A kind of method of Automatic Extraction forum data | |
CN103942340A (en) | Microblog user interest recognizing method based on text mining | |
CN104408093A (en) | News event element extracting method and device | |
CN102693304B (en) | Search engine feedback information processing method and search engine | |
CN103077164A (en) | Text analysis method and text analyzer | |
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
CN101777060A (en) | Automatic evaluation method and system of webpage visual quality | |
CN105512333A (en) | Product comment theme searching method based on emotional tendency | |
CN105573979B (en) | A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character | |
CN101515272A (en) | Method and device for extracting webpage content | |
CN109543126A (en) | Web page text information extracting method based on block text accounting | |
CN106156372A (en) | The sorting technique of a kind of internet site and device | |
CN107402916A (en) | The segmenting method and device of Chinese text | |
CN108874996A (en) | website classification method and device | |
CN104598648B (en) | A kind of microblog users interactive mode gender identification method and device | |
CN105183715A (en) | Word distribution and document feature based automatic classification method for spam comments | |
CN102799638B (en) | In-page navigation generation method facing barrier-free access to webpage contents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |