CN102141998B - Automatic evaluation method for webpage vision complexity - Google Patents

Automatic evaluation method for webpage vision complexity Download PDF

Info

Publication number
CN102141998B
CN102141998B CN 201010106759 CN201010106759A CN102141998B CN 102141998 B CN102141998 B CN 102141998B CN 201010106759 CN201010106759 CN 201010106759 CN 201010106759 A CN201010106759 A CN 201010106759A CN 102141998 B CN102141998 B CN 102141998B
Authority
CN
China
Prior art keywords
webpage
vision
sample
complexity
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201010106759
Other languages
Chinese (zh)
Other versions
CN102141998A (en
Inventor
吴偶
胡卫明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN 201010106759 priority Critical patent/CN102141998B/en
Publication of CN102141998A publication Critical patent/CN102141998A/en
Application granted granted Critical
Publication of CN102141998B publication Critical patent/CN102141998B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an automatic evaluation method for webpage vision complexity. The method comprises the following steps of: firstly, collecting webpage samples, manually marking each sample to be a webpage with complex vision or a webpage with simple vision so as to establish a training set, segmenting each webpage by using a webpage segmentation algorithm and extracting a layout block and a text block of the webpage, converting each webpage into an image, and extracting three characteristics such as a source code characteristic, a structure characteristic and a vision characteristic of each webpage by combination of the webpage source code and the extracted layout block and text block of the webpage; and secondly, training a random forest classifier by using the obtained webpage characteristics, obtaining classifier parameters, evaluating a new webpage and judging whether the vision of the new webpage is complex or not. The automatic evaluation method for webpage vision complexity can be applied to many aspects such as Web search, web design and the like and can improve performance of application programs based on Web.

Description

The automatic evaluation method of webpage visual complexity
Technical field
The present invention relates to the Computer Applied Technology field, particularly a kind of evaluation method of webpage visual complexity.
Background technology
Internet web page has not only comprised the needed various information of people, while or the user interface of internet (User Interface, UI).The user that the visually-perceptible of webpage affects webpage experiences.The scholar in a lot of fields has begun one's study the vision complexity of webpage to the mutual impact of user's webpage.Existing research points out, the webpage that the vision complexity is higher has affected the accessibility of webpage so that the cognitive complexity when user's accessed web page is higher.For the webpage of a vision complexity, its content is difficult to be obtained smoothly by user visually impaired.Therefore in man-machine interaction (HumanComputer Interaction, HCI) and field of webpage design, estimate existing a lot of work in the webpage visual complexity.But because the researchist in these fields relatively is being short of aspect web mining, Vision information processing, the model construction usually, designed evaluation model is not suitable for the automatic Evaluation to extensive webpage.
Summary of the invention
The technical matters that (one) will solve
In view of this, fundamental purpose of the present invention provides a kind of automatic evaluation method of webpage visual complexity.
(2) technical scheme
For achieving the above object, the invention provides a kind of automatic evaluation method of webpage visual complexity, the method comprises:
Step 1: at first collect webpage Sample Establishing training set as much as possible, utilize and come manually visually whether complexity is passed judgment on to every width of cloth webpage, handmarking's synthesis result is that the sample of vision complexity forms positive class sample set, handmarking's synthesis result is that the simple sample of vision forms negative class sample set, and two set have consisted of training set;
Step 2: obtain the source code of every width of cloth webpage, utilize the webpage partitioning algorithm to come every width of cloth webpage is cut apart and extracted page layout piece and text block;
Step 3: convert each width of cloth webpage to a sub-picture, extract the feature of three aspects of every width of cloth webpage: source code feature, architectural feature and visual signature;
Step 4: utilize the feature of the every width of cloth webpage that obtains that the random forest sorter is trained, obtain classifier parameters, and the new web page sample is estimated, judge whether it is the webpage that is higher than complexity threshold.
Wherein, the described training set of setting up, be please a plurality of users visually whether complexity is passed judgment on to each webpage sample, obtain a plurality of vision complexity evaluation results of each sample and a plurality of vision complexity evaluation results are averaged calculating, the webpage sample that is higher than the vision complexity threshold for the mean value that calculates, its handmarking's synthesis result is that vision is complicated, this sample is classified as positive class sample, be lower than the webpage sample of vision complexity threshold, its handmarking's synthesis result is that vision is simple, and this sample is classified as negative class sample; All positive class samples form positive class sample set, and all negative class samples form negative class sample set, and two set have consisted of training set;
Wherein, described source code feature comprises: webpage is included as the alphabetic character number, webpage comprises hyperlink display text character number, webpage use font number, webpage background color number, Web page image number.
Wherein, described architectural feature comprises: the number of the number of page layout piece, web page text piece, the web page text piece total area account for Area Ratio, webpage alphabetic character number and the web page text piece area of overall webpage ratio, webpage length breadth ratio, webpage length and width and.
Wherein, described visual signature comprises: the colored kilsyth basalt that the lightness that the hue table of webpage is shown Hue (Page), webpage is expressed as Brightness (Page), webpage is shown Colorfulness (Page) and webpage and is converted to corresponding file size behind the image.The calculating of visual signature at first needs a secondary webpage at first is converted into an assistant figure, and represents with color space HSV and color space RGB respectively, then adopts following formula:
Hue ( Page ) = Σ i = 1 N Σ j = 1 M H ( i , j ) / ( N · M )
Brightness ( Page ) = Σ i = 1 N Σ j = 1 M V ( i , j ) / ( N · M )
Colorfulness(Page)=α rgyb+0.3β rgyb
Wherein, i and j refer to image pixel in the horizontal direction with vertical direction on the position, N and M are respectively the height and width of webpage correspondence image; H (i, j) for the pixel that is positioned at image (i, j) in the H of color space HSV value; S (i, j) for the pixel that is positioned at image (i, j) in the S of color space HSV value; V (i, j) for the pixel that is positioned at image (i, j) in the V of color space HSV value; α in the colour degree computing formula RgybAnd β RgybRepresent respectively variance color factor and average color factor, its computing formula is:
rg(i,j)=R(i,j)-G(i,j)
yb(i,j)=0.5(R(i,j)+G(i,j))-B(i,j)
α rgyb = [ σ rg ] 2 + [ σ yb ] 2
β rgyb = [ μ rg ] 2 + [ μ yb ] 2
R (i, j), G (i, j), and B (i, j), the pixel of presentation video (i, j) be at the R of color space RGB, G and B value; Wherein all image slices vegetarian refreshments in color space RGB the R value and the variance of the difference rg of G value be σ Rg, average is μ RgAll image slices vegetarian refreshments R in color space RGB, the variance of the difference yb of the average of G and B component is σ Yb, average is μ Yb
Wherein, adopt and support the random forest sorting algorithm that webpage is classified, judge whether it is higher than the webpage of vision complexity threshold.
(3) beneficial effect
Can find out that from technique scheme the present invention has the following advantages:
1, the evaluation method of this webpage visual complexity provided by the invention, extract the visual signature of webpage from three aspects: source code feature, architectural feature, visual signature, the description that the three aspects: feature is comparatively complete the information that may affect its vision complexity of one secondary webpage: the quantity of layout, text image and distribution, visual information.Every one side feature can independently be changed and expand, if so that more fast and effectively feature occurred from now on, can add in this method easily, thus the further performance of method for improving.
2, the feature extraction of this method and sorter processing procedure are automatically fully, do not need manual intervention, therefore can very easily be embedded in the related application of present all kinds of Web, are with a wide range of applications.
Description of drawings
Fig. 1 a is page layout format piece of the present invention;
Fig. 1 b is the text block of webpage of the present invention;
Fig. 2 is the process flow diagram of webpage visual complexity evaluation method provided by the invention;
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Execution environment of the present invention adopts an algorithm routine that has the Pentium 4 computing machine of 3.0G hertz central processing unit and 2G byte of memory and worked out webpage visual complexity evaluation method with C Plus Plus, realized the automatic webpage visual complexity of the present invention evaluation method, can also adopt other execution environment, not repeat them here.
Fig. 2 is the process flow diagram of webpage visual complexity evaluation method provided by the invention, and its step is as follows:
Step 301: at first collect webpage sample as much as possible, utilize manually each sample labeling is vision complexity high webpage or the low webpage of vision complexity, handmarking's synthesis result is that the high sample of vision complexity forms positive class sample set, and handmarking's synthesis result is that the simple sample of vision forms negative class sample set; Positive class sample set and negative class sample set have consisted of training set; This step should be collected many webpage samples as much as possible, and it is representative widely that the training set of setting up is had.And manually visually whether complicated to each webpage sample in utilization, also be that the vision complexity is carried out on the mark, please a plurality of users carry out the judge of vision complexity to each sample as far as possible, the result who passes judgment on is the vision complexity scores that provides each sample, mark is an interval class of setting in advance, the higher expression vision of the score value that the user gives complexity is higher, after obtaining a plurality of vision complexity evaluation results of each sample, a plurality of vision complexity evaluation results are averaged calculating, the mean value that calculates is higher than the vision complexity threshold webpage sample of (threshold value is decided to be the interval intermediate value of marking), the synthesis result of its artificial mark is that vision is complicated, be labeled as positive class sample, be lower than the webpage sample of vision complexity threshold, the synthesis result of its artificial mark is that vision is simple, is labeled as negative class sample.
The marking interval of supposing the vision complexity of webpage is [0,10], and this webpage of the higher expression of score value is visually more complicated, and the vision complexity threshold is chosen for the interval intermediate value of vision complexity marking, also is 5; Supposing has four users that the marking of some samples is respectively: 1,2,3,6, and its mean value is 3, and less than 5, the artificial mark synthesis result of this sample is that vision is simple, and this sample labeling is negative class sample so.Suppose and be respectively: 5,10,7,8, its mean value is 7.5, and greater than 5, the artificial mark synthesis result of this sample is that vision is complicated, and this sample labeling is positive class sample so.
Step 302: on the training set webpage, every width of cloth webpage is cut apart, extract the page layout piece shown in Fig. 1 a (rectangle frame that is surrounded by thick line), with the text block of the webpage shown in Fig. 1 b (rectangle frame that is surrounded by thick line), with as the further input of feature extraction;
The dividing method of webpage can have a lot of selections, as: based on the webpage partitioning algorithm (VIPS) of vision, based on webpage partitioning algorithm of document tree (DOM) etc.Utilize the webpage partitioning algorithm to produce corresponding webpage visual piece tree.Rectangle corresponding to the leaf node of webpage visual piece tree is as the page layout format piece, contain the text character number in the leaf node of webpage visual piece tree more than or equal to the text block of matrix-block corresponding to the leaf node of given threshold value (scope of threshold value is 30-100, generally chooses 50) as webpage;
Step 303: obtain the source code of webpage, webpage is converted to image, according to layout piece and the text block extracted, extract respectively net source code feature, architectural feature and visual signature;
Webpage source code feature comprises that webpage is included as the alphabetic character number, webpage comprises hyperlink display text character number, webpage font number, webpage background color number, Web page image number;
The structure of web page feature comprise the number of number, the web page text piece of page layout piece, Area Ratio, webpage alphabetic character number and web page text piece area that the web page text piece accounts for overall webpage ratio, webpage length breadth ratio, webpage length and width and.;
The webpage visual feature comprises: the colored kilsyth basalt that the lightness that the hue table of webpage is shown Hue (Page), webpage is expressed as Brightness (Page), webpage is shown Colorfulness (Page) and webpage and is converted to corresponding file size behind the image.The calculating of visual signature at first needs a secondary webpage at first is converted into an assistant figure, and represents with color space HSV and color space RGB respectively, then adopts following formula:
Hue ( Page ) = Σ i = 1 N Σ j = 1 M H ( i , j ) / ( N · M )
Brightness ( Page ) = Σ i = 1 N Σ j = 1 M V ( i , j ) / ( N · M )
Colorfulness(Page)=α rgyb+0.3β rgyb
Wherein, i and j refer to image pixel in the horizontal direction with vertical direction on the position, N and M are respectively the height and width of webpage correspondence image; H (i, j) for the pixel that is positioned at image (i, j) in the H of color space HSV value; S (i, j) for the pixel that is positioned at image (i, j) in the S of color space HSV value; V (i, j) for the pixel that is positioned at image (i, j) in the V of color space HSV value; α in the colour degree computing formula RgybAnd β RgybRepresent respectively variance color factor and average color factor, its computing formula is:
rg(i,j)=R(i,j)-G(i,j)
yb(i,j)=0.5(R(i,j)+G(i,j))-B(i,j)
α rgyb = [ σ rg ] 2 + [ σ yb ] 2
β rgyb = [ μ rg ] 2 + [ μ yb ] 2
R (i, j), G (i, j), and B (i, j), the pixel of presentation video (i, j) be at the R of color space RGB, G and B value; Wherein all image slices vegetarian refreshments in color space RGB the R value and the variance of the difference rg of G value be σ Rg, average is μ RgAll image slices vegetarian refreshments R in color space RGB, the variance of the difference yb of the average of G and B component is σ Yb, average is μ Yb
Step 304: the set of eigenvectors that obtains is divided into training set and test set two parts, utilize training set that the random forest sorter is trained, obtain classifier parameters, the random forest sorter model that utilization trains is classified to test set, and the output category result, and the new web page sample predicted.
The above; only be the embodiment among the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected; all should be encompassed in of the present invention comprising within the scope, therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (2)

1. the evaluation method of a webpage visual complexity is characterized in that, the method comprises:
Step 1: at first collect webpage Sample Establishing training set as much as possible, utilize and come manually visually whether complexity is passed judgment on to every width of cloth webpage, handmarking's synthesis result is that the sample of vision complexity forms positive class sample set, handmarking's synthesis result is that the simple sample of vision forms negative class sample set, and two set have consisted of training set;
Step 2: obtain the source code of every width of cloth webpage at training set, utilize the webpage partitioning algorithm to come every width of cloth webpage is cut apart and extracted page layout piece and text block;
Step 3: convert each width of cloth webpage to piece image, extract the feature of three aspects of every width of cloth webpage: source code feature, architectural feature, visual signature;
Described source code feature comprises: webpage comprises the alphabetic character number, webpage comprises hyperlink display text character number, webpage use font number, webpage background color number, Web page image number;
Described architectural feature comprises: the number of the number of page layout piece, web page text piece, web page text piece account for Area Ratio, webpage alphabetic character number and the web page text piece area of overall webpage ratio, webpage length breadth ratio, webpage length and width and;
Described visual signature comprises: the colored kilsyth basalt that the lightness that the hue table of webpage is shown Hue (Page), webpage is expressed as Brightness (Page), webpage is shown the file size after Colorfulness (Page) and webpage are converted into image; The calculating of visual signature at first needs a width of cloth webpage is converted into piece image, and represents with color space HSV and color space RGB respectively, then adopts following formula:
Figure RE-FDA00002270551400012
Colorfulness(Page)=α rgyb+0.3β rgyb
Wherein, i and j refer to image pixel in the horizontal direction with vertical direction on the position, N and M are respectively the height and width of webpage correspondence image; H (i, j) for the pixel that is positioned at image (i, j) in the H of color space HSV value; S (i, j) for the pixel that is positioned at image (i, j) in the S of color space HSV value; V (i, j) for the pixel that is positioned at image (i, j) in the V of color space HSV value; α in the colour degree computing formula RgybAnd β RgybRepresent respectively variance color factor and average color factor, its computing formula is:
rg(i,j)=R(i,j)-G(i,j)
yb(i,j)=0.5(R(i,j)+G(i,j))-B(i,j)
Figure RE-FDA00002270551400021
R (i, j), G (i, j), and B (i, j), the pixel of presentation video (i, j) be at the R of color space RGB, G and B value; Wherein all image slices vegetarian refreshments in color space RGB the R value and the variance of the difference rg of G value be σ Rg,Average is μ RgAll image slices vegetarian refreshments R in color space RGB, the variance of the difference yb of the average of G and B component is σ Yb, average is μ Yb
Step 4: utilize the feature of the every width of cloth webpage that obtains that the random forest sorter is trained, obtain classifier parameters, and the new web page sample is estimated, judge whether it is the webpage that is higher than complexity threshold.
2. the evaluation method of webpage visual complexity according to claim 1, it is characterized in that, the described training set of setting up, be please a plurality of users visually whether complexity is passed judgment on to each webpage sample, obtain a plurality of vision complexity evaluation results of each sample and a plurality of vision complexity evaluation results are averaged calculating, the webpage sample that is higher than the vision complexity threshold for the mean value that calculates, its handmarking's synthesis result is that vision is complicated, this sample is classified as positive class sample, be lower than the webpage sample of vision complexity threshold, its handmarking's synthesis result is that vision is simple, and this sample is classified as negative class sample; All positive class samples form positive class sample set, and all negative class samples form negative class sample set, and two set have consisted of training set.
CN 201010106759 2010-02-03 2010-02-03 Automatic evaluation method for webpage vision complexity Expired - Fee Related CN102141998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010106759 CN102141998B (en) 2010-02-03 2010-02-03 Automatic evaluation method for webpage vision complexity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010106759 CN102141998B (en) 2010-02-03 2010-02-03 Automatic evaluation method for webpage vision complexity

Publications (2)

Publication Number Publication Date
CN102141998A CN102141998A (en) 2011-08-03
CN102141998B true CN102141998B (en) 2013-02-27

Family

ID=44409522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010106759 Expired - Fee Related CN102141998B (en) 2010-02-03 2010-02-03 Automatic evaluation method for webpage vision complexity

Country Status (1)

Country Link
CN (1) CN102141998B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599931B (en) * 2016-12-23 2019-07-02 南京师范大学 A kind of broken ridge line correlating method based on random forest
CN107358209B (en) * 2017-07-17 2020-02-28 成都通甲优博科技有限责任公司 Training method and device of face detection model and face detection method and device
CN108171760A (en) * 2018-01-29 2018-06-15 河南大学 A kind of image forms complexity calculating method
CN108921184A (en) * 2018-04-18 2018-11-30 中国科学院信息工程研究所 A kind of general type of webpage determination method
CN109740435A (en) * 2018-11-30 2019-05-10 四川译讯信息科技有限公司 A kind of picture class file complexity determination method and platform
CN109740434A (en) * 2018-11-30 2019-05-10 四川译讯信息科技有限公司 A kind of document class file complexity determination method and platform

Also Published As

Publication number Publication date
CN102141998A (en) 2011-08-03

Similar Documents

Publication Publication Date Title
CN101777060B (en) Webpage classification method and system based on webpage visual characteristics
CN102141998B (en) Automatic evaluation method for webpage vision complexity
CN105516802B (en) The news video abstract extraction method of multiple features fusion
CN102156865A (en) Handwritten text line character segmentation method and identification method
CN102682120B (en) Method and device for acquiring essential article commented on network
CN112528997B (en) Tibetan-Chinese bilingual scene text detection method based on text center region amplification
CN101533517A (en) Structure feature based on Chinese painting and calligraphy seal image automatic extracting method
CN107301200A (en) A kind of article appraisal procedure and system analyzed based on Sentiment orientation
CN102880865A (en) Dynamic gesture recognition method based on complexion and morphological characteristics
CN108334493A (en) A kind of topic knowledge point extraction method based on neural network
CN103455823B (en) The English character recognition method that a kind of fuzzy based on classification and image is split
CN103440494A (en) Horrible image identification method and system based on visual significance analyses
CN101980210A (en) Marked word classifying and grading method and system
CN110765739A (en) Method for extracting table data and chapter structure from PDF document
CN101251896B (en) Object detecting system and method based on multiple classifiers
CN105808126A (en) Electronic teaching whiteboard operation method and device
EP2458334A3 (en) Road estimation device and method for estimating road
CN105930798A (en) Tongue image quick detection and segmentation method based on learning and oriented to handset application
CN106372639B (en) Block letter Uighur document cutting method based on morphology and integral projection
CN103985130A (en) Image significance analysis method for complex texture images
EP2733643A3 (en) System and method facilitating designing of classifier while recognizing characters in a video
CN104281850A (en) Character area identification method and device
CN105844679A (en) Method and device for complex braille dot pattern drafting and embedded character input
CN109447015A (en) A kind of method and device handling form Image center selection word
CA2971996C (en) Chinese character information recording method and chinese character stroke order determining diagram device for teaching

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130227

Termination date: 20220203

CF01 Termination of patent right due to non-payment of annual fee