CN102141998B - Automatic evaluation method for webpage vision complexity - Google Patents
Automatic evaluation method for webpage vision complexity Download PDFInfo
- Publication number
- CN102141998B CN102141998B CN 201010106759 CN201010106759A CN102141998B CN 102141998 B CN102141998 B CN 102141998B CN 201010106759 CN201010106759 CN 201010106759 CN 201010106759 A CN201010106759 A CN 201010106759A CN 102141998 B CN102141998 B CN 102141998B
- Authority
- CN
- China
- Prior art keywords
- webpage
- vision
- sample
- complexity
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000000034 method Methods 0.000 claims abstract description 11
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 8
- 238000007637 random forest analysis Methods 0.000 claims abstract description 6
- 230000000007 visual effect Effects 0.000 claims description 27
- 230000015572 biosynthetic process Effects 0.000 claims description 14
- 239000004744 fabric Substances 0.000 claims description 14
- 238000003786 synthesis reaction Methods 0.000 claims description 14
- 238000000638 solvent extraction Methods 0.000 claims description 5
- 238000013461 design Methods 0.000 abstract description 2
- 230000011218 segmentation Effects 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an automatic evaluation method for webpage vision complexity. The method comprises the following steps of: firstly, collecting webpage samples, manually marking each sample to be a webpage with complex vision or a webpage with simple vision so as to establish a training set, segmenting each webpage by using a webpage segmentation algorithm and extracting a layout block and a text block of the webpage, converting each webpage into an image, and extracting three characteristics such as a source code characteristic, a structure characteristic and a vision characteristic of each webpage by combination of the webpage source code and the extracted layout block and text block of the webpage; and secondly, training a random forest classifier by using the obtained webpage characteristics, obtaining classifier parameters, evaluating a new webpage and judging whether the vision of the new webpage is complex or not. The automatic evaluation method for webpage vision complexity can be applied to many aspects such as Web search, web design and the like and can improve performance of application programs based on Web.
Description
Technical field
The present invention relates to the Computer Applied Technology field, particularly a kind of evaluation method of webpage visual complexity.
Background technology
Internet web page has not only comprised the needed various information of people, while or the user interface of internet (User Interface, UI).The user that the visually-perceptible of webpage affects webpage experiences.The scholar in a lot of fields has begun one's study the vision complexity of webpage to the mutual impact of user's webpage.Existing research points out, the webpage that the vision complexity is higher has affected the accessibility of webpage so that the cognitive complexity when user's accessed web page is higher.For the webpage of a vision complexity, its content is difficult to be obtained smoothly by user visually impaired.Therefore in man-machine interaction (HumanComputer Interaction, HCI) and field of webpage design, estimate existing a lot of work in the webpage visual complexity.But because the researchist in these fields relatively is being short of aspect web mining, Vision information processing, the model construction usually, designed evaluation model is not suitable for the automatic Evaluation to extensive webpage.
Summary of the invention
The technical matters that (one) will solve
In view of this, fundamental purpose of the present invention provides a kind of automatic evaluation method of webpage visual complexity.
(2) technical scheme
For achieving the above object, the invention provides a kind of automatic evaluation method of webpage visual complexity, the method comprises:
Step 1: at first collect webpage Sample Establishing training set as much as possible, utilize and come manually visually whether complexity is passed judgment on to every width of cloth webpage, handmarking's synthesis result is that the sample of vision complexity forms positive class sample set, handmarking's synthesis result is that the simple sample of vision forms negative class sample set, and two set have consisted of training set;
Step 2: obtain the source code of every width of cloth webpage, utilize the webpage partitioning algorithm to come every width of cloth webpage is cut apart and extracted page layout piece and text block;
Step 3: convert each width of cloth webpage to a sub-picture, extract the feature of three aspects of every width of cloth webpage: source code feature, architectural feature and visual signature;
Step 4: utilize the feature of the every width of cloth webpage that obtains that the random forest sorter is trained, obtain classifier parameters, and the new web page sample is estimated, judge whether it is the webpage that is higher than complexity threshold.
Wherein, the described training set of setting up, be please a plurality of users visually whether complexity is passed judgment on to each webpage sample, obtain a plurality of vision complexity evaluation results of each sample and a plurality of vision complexity evaluation results are averaged calculating, the webpage sample that is higher than the vision complexity threshold for the mean value that calculates, its handmarking's synthesis result is that vision is complicated, this sample is classified as positive class sample, be lower than the webpage sample of vision complexity threshold, its handmarking's synthesis result is that vision is simple, and this sample is classified as negative class sample; All positive class samples form positive class sample set, and all negative class samples form negative class sample set, and two set have consisted of training set;
Wherein, described source code feature comprises: webpage is included as the alphabetic character number, webpage comprises hyperlink display text character number, webpage use font number, webpage background color number, Web page image number.
Wherein, described architectural feature comprises: the number of the number of page layout piece, web page text piece, the web page text piece total area account for Area Ratio, webpage alphabetic character number and the web page text piece area of overall webpage ratio, webpage length breadth ratio, webpage length and width and.
Wherein, described visual signature comprises: the colored kilsyth basalt that the lightness that the hue table of webpage is shown Hue (Page), webpage is expressed as Brightness (Page), webpage is shown Colorfulness (Page) and webpage and is converted to corresponding file size behind the image.The calculating of visual signature at first needs a secondary webpage at first is converted into an assistant figure, and represents with color space HSV and color space RGB respectively, then adopts following formula:
Colorfulness(Page)=α
rgyb+0.3β
rgyb
Wherein, i and j refer to image pixel in the horizontal direction with vertical direction on the position, N and M are respectively the height and width of webpage correspondence image; H (i, j) for the pixel that is positioned at image (i, j) in the H of color space HSV value; S (i, j) for the pixel that is positioned at image (i, j) in the S of color space HSV value; V (i, j) for the pixel that is positioned at image (i, j) in the V of color space HSV value; α in the colour degree computing formula
RgybAnd β
RgybRepresent respectively variance color factor and average color factor, its computing formula is:
rg(i,j)=R(i,j)-G(i,j)
yb(i,j)=0.5(R(i,j)+G(i,j))-B(i,j)
R (i, j), G (i, j), and B (i, j), the pixel of presentation video (i, j) be at the R of color space RGB, G and B value; Wherein all image slices vegetarian refreshments in color space RGB the R value and the variance of the difference rg of G value be σ
Rg, average is μ
RgAll image slices vegetarian refreshments R in color space RGB, the variance of the difference yb of the average of G and B component is σ
Yb, average is μ
Yb
Wherein, adopt and support the random forest sorting algorithm that webpage is classified, judge whether it is higher than the webpage of vision complexity threshold.
(3) beneficial effect
Can find out that from technique scheme the present invention has the following advantages:
1, the evaluation method of this webpage visual complexity provided by the invention, extract the visual signature of webpage from three aspects: source code feature, architectural feature, visual signature, the description that the three aspects: feature is comparatively complete the information that may affect its vision complexity of one secondary webpage: the quantity of layout, text image and distribution, visual information.Every one side feature can independently be changed and expand, if so that more fast and effectively feature occurred from now on, can add in this method easily, thus the further performance of method for improving.
2, the feature extraction of this method and sorter processing procedure are automatically fully, do not need manual intervention, therefore can very easily be embedded in the related application of present all kinds of Web, are with a wide range of applications.
Description of drawings
Fig. 1 a is page layout format piece of the present invention;
Fig. 1 b is the text block of webpage of the present invention;
Fig. 2 is the process flow diagram of webpage visual complexity evaluation method provided by the invention;
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Execution environment of the present invention adopts an algorithm routine that has the Pentium 4 computing machine of 3.0G hertz central processing unit and 2G byte of memory and worked out webpage visual complexity evaluation method with C Plus Plus, realized the automatic webpage visual complexity of the present invention evaluation method, can also adopt other execution environment, not repeat them here.
Fig. 2 is the process flow diagram of webpage visual complexity evaluation method provided by the invention, and its step is as follows:
Step 301: at first collect webpage sample as much as possible, utilize manually each sample labeling is vision complexity high webpage or the low webpage of vision complexity, handmarking's synthesis result is that the high sample of vision complexity forms positive class sample set, and handmarking's synthesis result is that the simple sample of vision forms negative class sample set; Positive class sample set and negative class sample set have consisted of training set; This step should be collected many webpage samples as much as possible, and it is representative widely that the training set of setting up is had.And manually visually whether complicated to each webpage sample in utilization, also be that the vision complexity is carried out on the mark, please a plurality of users carry out the judge of vision complexity to each sample as far as possible, the result who passes judgment on is the vision complexity scores that provides each sample, mark is an interval class of setting in advance, the higher expression vision of the score value that the user gives complexity is higher, after obtaining a plurality of vision complexity evaluation results of each sample, a plurality of vision complexity evaluation results are averaged calculating, the mean value that calculates is higher than the vision complexity threshold webpage sample of (threshold value is decided to be the interval intermediate value of marking), the synthesis result of its artificial mark is that vision is complicated, be labeled as positive class sample, be lower than the webpage sample of vision complexity threshold, the synthesis result of its artificial mark is that vision is simple, is labeled as negative class sample.
The marking interval of supposing the vision complexity of webpage is [0,10], and this webpage of the higher expression of score value is visually more complicated, and the vision complexity threshold is chosen for the interval intermediate value of vision complexity marking, also is 5; Supposing has four users that the marking of some samples is respectively: 1,2,3,6, and its mean value is 3, and less than 5, the artificial mark synthesis result of this sample is that vision is simple, and this sample labeling is negative class sample so.Suppose and be respectively: 5,10,7,8, its mean value is 7.5, and greater than 5, the artificial mark synthesis result of this sample is that vision is complicated, and this sample labeling is positive class sample so.
Step 302: on the training set webpage, every width of cloth webpage is cut apart, extract the page layout piece shown in Fig. 1 a (rectangle frame that is surrounded by thick line), with the text block of the webpage shown in Fig. 1 b (rectangle frame that is surrounded by thick line), with as the further input of feature extraction;
The dividing method of webpage can have a lot of selections, as: based on the webpage partitioning algorithm (VIPS) of vision, based on webpage partitioning algorithm of document tree (DOM) etc.Utilize the webpage partitioning algorithm to produce corresponding webpage visual piece tree.Rectangle corresponding to the leaf node of webpage visual piece tree is as the page layout format piece, contain the text character number in the leaf node of webpage visual piece tree more than or equal to the text block of matrix-block corresponding to the leaf node of given threshold value (scope of threshold value is 30-100, generally chooses 50) as webpage;
Step 303: obtain the source code of webpage, webpage is converted to image, according to layout piece and the text block extracted, extract respectively net source code feature, architectural feature and visual signature;
Webpage source code feature comprises that webpage is included as the alphabetic character number, webpage comprises hyperlink display text character number, webpage font number, webpage background color number, Web page image number;
The structure of web page feature comprise the number of number, the web page text piece of page layout piece, Area Ratio, webpage alphabetic character number and web page text piece area that the web page text piece accounts for overall webpage ratio, webpage length breadth ratio, webpage length and width and.;
The webpage visual feature comprises: the colored kilsyth basalt that the lightness that the hue table of webpage is shown Hue (Page), webpage is expressed as Brightness (Page), webpage is shown Colorfulness (Page) and webpage and is converted to corresponding file size behind the image.The calculating of visual signature at first needs a secondary webpage at first is converted into an assistant figure, and represents with color space HSV and color space RGB respectively, then adopts following formula:
Colorfulness(Page)=α
rgyb+0.3β
rgyb
Wherein, i and j refer to image pixel in the horizontal direction with vertical direction on the position, N and M are respectively the height and width of webpage correspondence image; H (i, j) for the pixel that is positioned at image (i, j) in the H of color space HSV value; S (i, j) for the pixel that is positioned at image (i, j) in the S of color space HSV value; V (i, j) for the pixel that is positioned at image (i, j) in the V of color space HSV value; α in the colour degree computing formula
RgybAnd β
RgybRepresent respectively variance color factor and average color factor, its computing formula is:
rg(i,j)=R(i,j)-G(i,j)
yb(i,j)=0.5(R(i,j)+G(i,j))-B(i,j)
R (i, j), G (i, j), and B (i, j), the pixel of presentation video (i, j) be at the R of color space RGB, G and B value; Wherein all image slices vegetarian refreshments in color space RGB the R value and the variance of the difference rg of G value be σ
Rg, average is μ
RgAll image slices vegetarian refreshments R in color space RGB, the variance of the difference yb of the average of G and B component is σ
Yb, average is μ
Yb
Step 304: the set of eigenvectors that obtains is divided into training set and test set two parts, utilize training set that the random forest sorter is trained, obtain classifier parameters, the random forest sorter model that utilization trains is classified to test set, and the output category result, and the new web page sample predicted.
The above; only be the embodiment among the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected; all should be encompassed in of the present invention comprising within the scope, therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.
Claims (2)
1. the evaluation method of a webpage visual complexity is characterized in that, the method comprises:
Step 1: at first collect webpage Sample Establishing training set as much as possible, utilize and come manually visually whether complexity is passed judgment on to every width of cloth webpage, handmarking's synthesis result is that the sample of vision complexity forms positive class sample set, handmarking's synthesis result is that the simple sample of vision forms negative class sample set, and two set have consisted of training set;
Step 2: obtain the source code of every width of cloth webpage at training set, utilize the webpage partitioning algorithm to come every width of cloth webpage is cut apart and extracted page layout piece and text block;
Step 3: convert each width of cloth webpage to piece image, extract the feature of three aspects of every width of cloth webpage: source code feature, architectural feature, visual signature;
Described source code feature comprises: webpage comprises the alphabetic character number, webpage comprises hyperlink display text character number, webpage use font number, webpage background color number, Web page image number;
Described architectural feature comprises: the number of the number of page layout piece, web page text piece, web page text piece account for Area Ratio, webpage alphabetic character number and the web page text piece area of overall webpage ratio, webpage length breadth ratio, webpage length and width and;
Described visual signature comprises: the colored kilsyth basalt that the lightness that the hue table of webpage is shown Hue (Page), webpage is expressed as Brightness (Page), webpage is shown the file size after Colorfulness (Page) and webpage are converted into image; The calculating of visual signature at first needs a width of cloth webpage is converted into piece image, and represents with color space HSV and color space RGB respectively, then adopts following formula:
Colorfulness(Page)=α
rgyb+0.3β
rgyb
Wherein, i and j refer to image pixel in the horizontal direction with vertical direction on the position, N and M are respectively the height and width of webpage correspondence image; H (i, j) for the pixel that is positioned at image (i, j) in the H of color space HSV value; S (i, j) for the pixel that is positioned at image (i, j) in the S of color space HSV value; V (i, j) for the pixel that is positioned at image (i, j) in the V of color space HSV value; α in the colour degree computing formula
RgybAnd β
RgybRepresent respectively variance color factor and average color factor, its computing formula is:
rg(i,j)=R(i,j)-G(i,j)
yb(i,j)=0.5(R(i,j)+G(i,j))-B(i,j)
R (i, j), G (i, j), and B (i, j), the pixel of presentation video (i, j) be at the R of color space RGB, G and B value; Wherein all image slices vegetarian refreshments in color space RGB the R value and the variance of the difference rg of G value be σ
Rg,Average is μ
RgAll image slices vegetarian refreshments R in color space RGB, the variance of the difference yb of the average of G and B component is σ
Yb, average is μ
Yb
Step 4: utilize the feature of the every width of cloth webpage that obtains that the random forest sorter is trained, obtain classifier parameters, and the new web page sample is estimated, judge whether it is the webpage that is higher than complexity threshold.
2. the evaluation method of webpage visual complexity according to claim 1, it is characterized in that, the described training set of setting up, be please a plurality of users visually whether complexity is passed judgment on to each webpage sample, obtain a plurality of vision complexity evaluation results of each sample and a plurality of vision complexity evaluation results are averaged calculating, the webpage sample that is higher than the vision complexity threshold for the mean value that calculates, its handmarking's synthesis result is that vision is complicated, this sample is classified as positive class sample, be lower than the webpage sample of vision complexity threshold, its handmarking's synthesis result is that vision is simple, and this sample is classified as negative class sample; All positive class samples form positive class sample set, and all negative class samples form negative class sample set, and two set have consisted of training set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010106759 CN102141998B (en) | 2010-02-03 | 2010-02-03 | Automatic evaluation method for webpage vision complexity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010106759 CN102141998B (en) | 2010-02-03 | 2010-02-03 | Automatic evaluation method for webpage vision complexity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102141998A CN102141998A (en) | 2011-08-03 |
CN102141998B true CN102141998B (en) | 2013-02-27 |
Family
ID=44409522
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010106759 Expired - Fee Related CN102141998B (en) | 2010-02-03 | 2010-02-03 | Automatic evaluation method for webpage vision complexity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102141998B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599931B (en) * | 2016-12-23 | 2019-07-02 | 南京师范大学 | A kind of broken ridge line correlating method based on random forest |
CN107358209B (en) * | 2017-07-17 | 2020-02-28 | 成都通甲优博科技有限责任公司 | Training method and device of face detection model and face detection method and device |
CN108171760A (en) * | 2018-01-29 | 2018-06-15 | 河南大学 | A kind of image forms complexity calculating method |
CN108921184A (en) * | 2018-04-18 | 2018-11-30 | 中国科学院信息工程研究所 | A kind of general type of webpage determination method |
CN109740435A (en) * | 2018-11-30 | 2019-05-10 | 四川译讯信息科技有限公司 | A kind of picture class file complexity determination method and platform |
CN109740434A (en) * | 2018-11-30 | 2019-05-10 | 四川译讯信息科技有限公司 | A kind of document class file complexity determination method and platform |
-
2010
- 2010-02-03 CN CN 201010106759 patent/CN102141998B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN102141998A (en) | 2011-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101777060B (en) | Webpage classification method and system based on webpage visual characteristics | |
CN102141998B (en) | Automatic evaluation method for webpage vision complexity | |
CN105516802B (en) | The news video abstract extraction method of multiple features fusion | |
CN102156865A (en) | Handwritten text line character segmentation method and identification method | |
CN102682120B (en) | Method and device for acquiring essential article commented on network | |
CN112528997B (en) | Tibetan-Chinese bilingual scene text detection method based on text center region amplification | |
CN101533517A (en) | Structure feature based on Chinese painting and calligraphy seal image automatic extracting method | |
CN107301200A (en) | A kind of article appraisal procedure and system analyzed based on Sentiment orientation | |
CN102880865A (en) | Dynamic gesture recognition method based on complexion and morphological characteristics | |
CN108334493A (en) | A kind of topic knowledge point extraction method based on neural network | |
CN103455823B (en) | The English character recognition method that a kind of fuzzy based on classification and image is split | |
CN103440494A (en) | Horrible image identification method and system based on visual significance analyses | |
CN101980210A (en) | Marked word classifying and grading method and system | |
CN110765739A (en) | Method for extracting table data and chapter structure from PDF document | |
CN101251896B (en) | Object detecting system and method based on multiple classifiers | |
CN105808126A (en) | Electronic teaching whiteboard operation method and device | |
EP2458334A3 (en) | Road estimation device and method for estimating road | |
CN105930798A (en) | Tongue image quick detection and segmentation method based on learning and oriented to handset application | |
CN106372639B (en) | Block letter Uighur document cutting method based on morphology and integral projection | |
CN103985130A (en) | Image significance analysis method for complex texture images | |
EP2733643A3 (en) | System and method facilitating designing of classifier while recognizing characters in a video | |
CN104281850A (en) | Character area identification method and device | |
CN105844679A (en) | Method and device for complex braille dot pattern drafting and embedded character input | |
CN109447015A (en) | A kind of method and device handling form Image center selection word | |
CA2971996C (en) | Chinese character information recording method and chinese character stroke order determining diagram device for teaching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130227 Termination date: 20220203 |
|
CF01 | Termination of patent right due to non-payment of annual fee |