CN103870486A - Webpage type confirming method and device - Google Patents

Webpage type confirming method and device Download PDF

Info

Publication number
CN103870486A
CN103870486A CN201210539055.0A CN201210539055A CN103870486A CN 103870486 A CN103870486 A CN 103870486A CN 201210539055 A CN201210539055 A CN 201210539055A CN 103870486 A CN103870486 A CN 103870486A
Authority
CN
China
Prior art keywords
web page
webpage
page characteristics
type
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210539055.0A
Other languages
Chinese (zh)
Inventor
张富强
杨巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shiji Guangsu Information Technology Co Ltd filed Critical Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority to CN201210539055.0A priority Critical patent/CN103870486A/en
Publication of CN103870486A publication Critical patent/CN103870486A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage type confirming method and device and belongs to the technological field of communication. The webpage type confirming method comprises obtaining document information of the webpage, wherein the document information of the webpage comprises webpage address URL (Uniform Resource Locator) information, webpage document content information and webpage visual information; extracting webpage characteristic parameters from the webpage document information; confirming the webpage type according to the extracted webpage characteristic parameters. The webpage type confirming device comprises an obtaining module and an extracting module. According to the webpage type confirming method and device, the webpage characteristic parameters are extracted from the webpage document information (comprising the webpage address URL information, the webpage document content information and the webpage visual information), the extracting range of the webpage characteristic parameters is wide, a plurality of webpage characteristic parameters can be extracted, so that the webpage type can be confirmed according to the extracted webpage characteristic parameters.

Description

Determine the method and apparatus of type of webpage
Technical field
The present invention relates to communication technical field, particularly a kind of method and apparatus of definite type of webpage.
Background technology
Along with developing rapidly of mobile Internet and mobile terminal (as mobile phone, panel computer etc.), except passing through, outside pc (person computer, personal computer) terminal accessing Internet, can also to pass through mobile terminal accessing Internet.Therefore except the web webpage for pc terminal, also there is the wap(Wireless Application Protocol for mobile terminal, WAP (wireless application protocol)) webpage.But in pc terminal, experience very badly for the wap webpage of mobile terminal, particularly wap1.0 webpage cannot be shown in pc terminal.Can not well on mobile terminal, show for the web webpage of pc terminal simultaneously.So for search engine, top priority in the time crawling webpage, it is the type of determining webpage, distinguishing current web page is wap webpage or web webpage, could not allow so the disagreeableness web webpage of mobile terminal is appeared in wap Webpage searching result, do not allow the disagreeableness wap webpage of pc terminal is appeared in web Webpage searching result simultaneously.
The method of existing definite type of webpage mainly comprises: the difference by wap webpage and web Web Page Tags language is determined, determined by web document content etc.
But realizing in process of the present invention, inventor finds that prior art at least exists following problem:
Difference by wap webpage and web Web Page Tags language is carried out definite method, can determine and use WML(Wireless Markup Language, WAP Markup Language) markup language Wap1.0 webpage and use HTML(Hyper Text Mark-up Language, HTML (Hypertext Markup Language)) the web webpage (difference of WML and two kinds of markup languages of HTML is very large) of markup language, use XHTML(eXtensible HyperText Markup Language but cannot determine, extensible HyperText Markup Language) the wap2.0 webpage (difference of XHTML and two kinds of markup languages of HTML is very little) of markup language and the web webpage of use HTML markup language.
The statement of DOCTYPE html PUBLIC " //WAPFORUM//DTD XHTML Mobile ", can determine wap2.0 webpage or web webpage by this statement for the webpage of writing according to specification, but in fact most of webpage is all nonstandard, cannot determine wap2.0 webpage or web webpage.
Summary of the invention
In order to solve the problem of prior art, the embodiment of the present invention provides a kind of method and apparatus of definite type of webpage.Described technical scheme is as follows:
On the one hand, provide a kind of method of definite type of webpage, described method comprises:
Obtain the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
From the document information of described webpage, extract web page characteristics parameter;
According to the web page characteristics parameter extracting, determine the type of described webpage.
On the other hand, provide a kind of device of definite type of webpage, described device comprises:
Acquisition module, for obtaining the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
Extraction module for obtaining at described acquisition module after the document information of webpage, extracts web page characteristics parameter from the document information of described webpage;
Determination module, at described extraction module from the document information of described webpage extracts web page characteristics parameter, according to the web page characteristics parameter extracting, determine the type of described webpage.
The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:
From the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.
Brief description of the drawings
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing of required use during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the method flow diagram of definite type of webpage of providing of the embodiment of the present invention one;
Fig. 2 is the method flow diagram of definite type of webpage of providing of the embodiment of the present invention two;
Fig. 3 is the apparatus structure schematic diagram of definite type of webpage of providing of the embodiment of the present invention three.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Embodiment mono-
The embodiment of the present invention provides a kind of method of definite type of webpage, and referring to Fig. 1, the method comprises:
101: the document information that obtains webpage.
Wherein, the document information of webpage comprises web page address URL information, web document content information and webpage visual information.
102: from the document information of webpage, extract web page characteristics parameter.
103: according to the web page characteristics parameter extracting, determine the type of this webpage.
Preferably, according to the web page characteristics parameter extracting, determine the type of this webpage, comprising:
According to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of this webpage;
According to the web page characteristics score of this webpage, determine the type of this webpage.
Preferably, according to the web page characteristics score of this webpage, determine the type of this webpage, comprising:
The web page characteristics score of this webpage and default web page characteristics score threshold value are compared;
If the web page characteristics score of this webpage is greater than default web page characteristics score threshold value, determine that the type of this webpage is WAP (wireless application protocol) wap2.0.
Preferably, after the web page characteristics score of this webpage and default web page characteristics score threshold value are compared, also comprise:
If the web page characteristics score of this webpage is less than or equal to default web page characteristics score threshold value, determine that the type of this webpage is web.
Preferably, according to the web page characteristics parameter extracting, determine the type of this webpage, comprising:
The web page characteristics parameter extracting is input in Web page classifying model; Wherein, Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
By Web page classifying model, determine the type of this webpage.
Preferably, the web page characteristics parameter in web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
Preferably, the web page characteristics parameter in web document content information comprises:
The xhtml mobile printed words that comprise in doctype statement; The meta that comprises MobileOptimized; < xml statement; The meta that comprises viewport, and in meta, comprise width=deviec-width or width=xxx; Give the correct time the time; Letter version, colored steel, touch screen version or dazzle color edition printed words; Containing the outer chain CSS (cascading style sheet) list css filename of wap, phone, mob or 3g printed words; The mobile phone version, 3g version or the wap version printed words that in title, comprise; Character code is ASCII, GB2312, GBK or BIG5 coding; The number of outer chain css file is greater than default css file number threshold value; The number of outer chain js file is greater than default js file number threshold value; Ratio between the number of all labels that comprise in the number of form table label, form main body tbody label, table row tr label and list data td label and web document content is greater than default label proportion threshold value; Width is greater than the html label of default width threshold value; The number of img label is greater than default img label number threshold value; JavaScript code; The dtd html printed words that comprise in doctype statement; Information fusion RSS subscription information; The event relevant to mouse action; Create embedded floating framework ifram label, specify a container button label, specify text and image are shown between two parties to center label, framework frame label, framework collection frameset label or on the page, place executable content applet label; Length for heading is greater than default length threshold; URL is not with www beginning, but the link number of the www comprising in web document content beginning is greater than default link number threshold value; Google ad code, and the width of google ad is greater than default width threshold value; Add collection code to; Webpage is arranged to the code of homepage; The meta that comprises MSThemeCompatible; The meta that comprises x-ua-compatible; The width property value of embedded or outer chain css is greater than 320 pixels; Be greater than at least one in default byte number threshold value with the size of web document content.
Preferably, the web page characteristics parameter in webpage visual Intelligence Page comprises:
In webpage, the width of each label node is all less than the first default width threshold value; Width is greater than the label node of the second default width threshold value; Lishu font or italic font; Float property value is that the number of the label node of right is greater than default label node number threshold value; The position of non-small icon class picture is positioned at default restricted area, and ratio between width and the width of whole webpage of non-small icon class picture is less than default width ratio threshold value; Be greater than at least one in the picture of the 3rd default width threshold value with width.
The method of the definite type of webpage described in the embodiment of the present invention, from the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.Further, by calculating the web page characteristics score of this webpage, according to the web page characteristics score of this webpage, determine the type of this webpage, can improve the accuracy of the type of determining this webpage.Further, by Web page classifying model, determine the types of web pages of this webpage, disaggregated model is to obtain by advance the web page characteristics parameter extracting is carried out to machine learning training from default multiple webpages, can improve the accuracy of the type of determining this webpage.
Embodiment bis-
The embodiment of the present invention provides a kind of method of definite type of webpage, and referring to Fig. 2, the method comprises:
201: the URL(Universal Resource Locator that obtains webpage, URL(uniform resource locator) (web page address)) information, HTTP(Hyper Text Transfer Protocol, HTTP) heading information, web document content information and webpage visual information.
Wherein, web document content information can be the source code of web document, or the DOM(Document Object Model of webpage, DOM Document Object Model) tree.
202: from the URL information of webpage, HTTP heading information, web document content information, extract wap1.0 web page characteristics parameter, judge whether to extract wap1.0 web page characteristics parameter, if can, hold 206; Otherwise, carry out 203.
Wherein, wap1.0 web page characteristics parameter comprises:
" wml " printed words in HTTP heading; Or wml, card(card in web document content), go(represents to jump to the action of new card) etc. the distinctive label of WML.
Particularly, if can extract " wml " printed words from HTTP heading, prove to comprise in HTTP heading information " wml " printed words; If can extract the distinctive labels of WML such as wml, card, go from web document content, prove to comprise in web document content information the distinctive labels of WML such as wml, card, go.If comprise " wml " printed words in HTTP heading, or in web document content, comprise the distinctive arbitrary labels of WML such as wml, card or go, this webpage is exactly wap1.0 webpage.
203: from URL information, web document content information and the webpage visual information of webpage, extract wap2.0 web page characteristics parameter, judge whether to extract wap2.0 web page characteristics parameter, if can, carry out 204; Otherwise, carry out 208.
Wherein, Wap2.0 web page characteristics parameter in URL information comprises: " //wap. ", "/wap/ " in URL, " //wap ", " // 3g(3rd-generation; 3G (Third Generation) Moblie technology). ", " .3g. ", "/3g/ ", " //m(mobil; mobile phone). ", the printed words such as "/m/ ", " .mobi(mobil, mobile phone)/";
Doctype(document type, Doctype) " xhtml mobile " printed words of comprising in statement; In web document content, comprise the meta(metadata of MobileOptimized (webpage is done width optimization setting for mobile device)); " < xml " statement comprising in web document content; In web document content, comprise viewport(window) meta, and in meta, comprise " width=deviec-width " or " width=xxx " printed words (wherein the value of xxx is less than or equal to 320 pixels); What the afterbody of web document content comprised gives the correct time the time; The printed words such as " letter version ", " colored steel ", " touch screen version ", " dazzling color version " that comprise in web document content; In web document content, contain the outer chain css(Cascading Style Sheet of " wap ", " phone ", " mob ", " 3g " printed words, CSS (cascading style sheet) list) filename; The printed words such as " mobile phone version ", " 3g version ", " wap version " that comprise in the title of web document content; The character code of web document content is ASCII(America Standard Code for Information Interchange, ASCII), non-UTF8 (simplified form of Chinese Character) coding such as GB2312 (Chinese Character Set Code for Informati (GB is the Pinyin abbreviation of " GB ", the 2312nd, GB sequence number)), GBK (expansion (K is the Pinyin abbreviation of " expansion ") of GB2312), BIG5 (Chinese-traditional); The number of the outer chain css file of web document content is greater than default css file number threshold value; The outer chain js(JavaScript of web document content) file number is greater than default js file number threshold value; The table(form comprising in web document content), tbody(table body, form main body), tr(table row, table row) and td(table data, list data) ratio between the number of all labels of comprising in the number of label and web document is greater than default label proportion threshold value; The width comprising in web document content is greater than the html label of default width threshold value (such as 320 pixels); The img(picture that web document content comprises) number of label is greater than default img label number threshold value; The JavaScript code comprising in web document content; In doctype statement, comprise " dtd(document type, Doctype) html " printed words; Doctype statement; The RSS(information fusion comprising in web document content) subscription information; The event relevant to mouse action comprising in web document content; The wap2.0 webpage comprising in web document content is not advised the label using, such as iframe(creates embedded floating framework), button(specifies a container), center(shows specify text and image between two parties), frame(framework), frameset(framework collection), applet(places executable content on the page) etc.; The length for heading of web document content is greater than default length threshold; URL is not with www beginning, but the link number of the www comprising in web document content beginning is greater than default link number threshold value; The googlead(advertisement comprising in web document content) code, and the width of google ad is greater than default width threshold value; The code of similar effects such as " adding collection to " of comprising in web document content; The code of webpage being arranged to homepage comprising in web document content; In web document content, comprise " MSThemeCompatible(XP(experience, experience) theme) " meta; In web document content, comprise the meta of " x-ua-compatible (for webpage specified documents pattern) "; The property value of the label comprising in web document does not use double quotation marks to bracket (do not use double quotation marks to bracket there is no double quotation marks or double quotation marks is not paired appearance) (property value of the html grammar request label of specification must bracket with double quotation marks); The width(width of the embedded or outer chain css comprising in web document content) property value is greater than the pattern of 320 pixels; The size of web document content is greater than default byte number threshold value (such as 500KB).
Wap2.0 web page characteristics parameter in webpage visual information comprises: in webpage, the width of each label node is all less than the first default width threshold value (such as 320 pixels); The width comprising in webpage is greater than the label node of the second default width threshold value (such as 320 pixels); The font (such as lishu, italic etc.) that the part mobile terminal comprising in webpage is not supported; The float(that comprises in webpage floats) property value is the right(right side) the number of label node be greater than default label node number threshold value; The position of the non-small icon class picture comprising in webpage is positioned at default restricted area and (comprises left restricted area and right restricted area as default restricted area is set; If non-small icon class picture is positioned at left restricted area, the location comparison of non-small icon class picture is to the left; If non-small icon class picture is positioned at right restricted area, the location comparison of non-small icon class picture is to the right), and ratio between width and the width of whole webpage of non-small icon class picture is less than default width ratio threshold value (such as <0.4); The width comprising in webpage is greater than the picture of default the 3rd width threshold value (such as 320 pixels).
Particularly, the Wap2.0 web page characteristics parameter in webpage visual information can be by setting up dom tree to web document content information, parses the width of each label node, highly, obtain after position and style information.
It should be noted that, each above-mentioned threshold value can arrange concrete value according to practical application situation, and this is not specifically limited.
204: according to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate wap2.0 web page characteristics score.
Particularly, can, according to having when webpage after this web page characteristics parameter, be the possibility of wap2.0 webpage, mark corresponding to this web page characteristics parameter is set.If had after this web page characteristics parameter, for the possibility of wap2.0 webpage is large, mark corresponding to this web page characteristics parameter is for just dividing (this web page characteristics parameter can be called positive web page characteristics parameter).If had after this web page characteristics parameter, be not that the possibility of wap2.0 webpage is large, mark corresponding to this web page characteristics parameter is negative point (this web page characteristics parameter can be called negative reticulum page characteristic parameter).And according to the size of possibility and impossibility, the size of mark is set.
In the wap2.0 web page characteristics parameter of extracting the above-mentioned URL information from webpage, web document content information and webpage visual information, the printed words such as " //wap. ", "/wap/ " in URL, " //wap ", " // 3g. ", " .3g. ", "/3g/ ", " //m. ", "/m/ ", " .mobi/ " are positive web page characteristics parameters." xhtml mobile " printed words that comprise in doctype statement ... the printed words such as " mobile phone version ", " 3g version ", " wap version " that comprise in the title of web document content are positive web page characteristics parameters, and other features are negative reticulum page characteristic parameters.In webpage in webpage visual information, to be all less than the first default width threshold value (such as 320 pixels) be positive web page characteristics parameter to the width of each label node, and other are negative reticulum page characteristic parameters.
Particularly, can, by web page characteristics parameter and web page characteristics mark corresponding saving one by one, when needed, read inquiry from the place of preserving.
Because wap2.0 web page characteristics parameter is unlike wap1.0 web page characteristics parameter, after meeting single characteristic parameter, can not determine it is wap2.0 webpage or web webpage completely, therefore adopt the mode of the comprehensive marking of all characteristic parameters is finally determined to webpage is wap2.0 webpage or web webpage.
205: by wap2.0 web page characteristics score, compare with default web page characteristics score threshold value, if be greater than default web page characteristics score threshold value, carry out 207; Otherwise, carry out 208.
206: the type of determining webpage is wap1.0, then finishes.
207: the type of determining webpage is wap2.0, then finishes.
208: the type of determining webpage is web, then finishes.
It should be noted that, be not limited to the mode by calculating wap2.0 web page characteristics score, the type of determining this webpage is wap2.0 or web, can also be in the following way:
The wap2.0 web page characteristics parameter extracting is input in Web page classifying model; Wherein, Web page classifying model obtains by the wap2.0 web page characteristics parameter extracting from default multiple webpages is carried out to machine learning.
By Web page classifying model, the type of determining this webpage is wap2.0 or web.
Wherein, the method for machine learning can be selected support vector machine (SVM), simple Bei Yesi etc., and this is not specifically limited.
The method of the definite type of webpage described in the embodiment of the present invention, from the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.Further, by calculating the web page characteristics score of this webpage, according to the web page characteristics score of this webpage, determine the type of this webpage, can improve the accuracy of the type of determining this webpage.Further, by Web page classifying model, determine the types of web pages of this webpage, disaggregated model is to obtain by advance the web page characteristics parameter extracting is carried out to machine learning training from default multiple webpages, can improve the accuracy of the type of determining this webpage.
Embodiment tri-
Referring to Fig. 3, the embodiment of the present invention provides a kind of device of definite type of webpage, it is characterized in that, this device comprises:
Acquisition module 301, for obtaining the document information of webpage; Wherein, the document information of webpage comprises web page address URL information, web document content information and webpage visual information;
Extraction module 302 for obtaining at acquisition module 301 after the document information of webpage, extracts web page characteristics parameter from the document information of webpage;
Determination module 303, at extraction module 302 from the document information of webpage extracts web page characteristics parameter, according to the web page characteristics parameter extracting, determine the type of webpage.
Preferably, determination module 303 comprises:
Computing unit, at extraction module 302 from the document information of webpage extracts web page characteristics parameter, according to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of webpage;
Determining unit, for calculating after the web page characteristics score of webpage at computing unit, according to the web page characteristics score of webpage, determines the type of webpage.
Preferably, determining unit comprises:
Relatively subelement, for calculating at computing unit after the web page characteristics score of webpage, compares the web page characteristics score of webpage and default web page characteristics score threshold value;
First determines subelement, is greater than default web page characteristics score threshold value for the web page characteristics score that is webpage at the comparative result that compares subelement, determines that the type of webpage is WAP (wireless application protocol) wap7.0.
Preferably, determining unit also comprises:
Second determines subelement, is less than or equal to default web page characteristics score threshold value for the web page characteristics score that is webpage at the comparative result that compares subelement, determines that the type of webpage is web.
Preferably, determination module 303 comprises:
Processing unit, at extraction module 302 from the document information of webpage extracts web page characteristics parameter, the web page characteristics parameter extracting is input in Web page classifying model; Wherein, Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
Disaggregated model determining unit, at processing unit, the web page characteristics parameter extracting being input to after Web page classifying model, by Web page classifying model, determines the type of webpage.
Preferably, the web page characteristics parameter in web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
Preferably, the web page characteristics parameter in web document content information comprises:
The xhtml mobile printed words that comprise in doctype statement; The meta that comprises MobileOptimized; < xml statement; The meta that comprises viewport, and in meta, comprise width=deviec-width or width=xxx; Give the correct time the time; Letter version, colored steel, touch screen version or dazzle color edition printed words; Containing the outer chain CSS (cascading style sheet) list css filename of wap, phone, mob or 3g printed words; The mobile phone version, 3g version or the wap version printed words that in title, comprise; Character code is ASCII, GB2312, GBK or BIG5 coding; The number of outer chain css file is greater than default css file number threshold value; The number of outer chain js file is greater than default js file number threshold value; Ratio between the number of all labels that comprise in the number of form table label, form main body tbody label, table row tr label and list data td label and web document content is greater than default label proportion threshold value; Width is greater than the html label of default width threshold value; The number of img label is greater than default img label number threshold value; JavaScript code; The dtd html printed words that comprise in doctype statement; Information fusion RSS subscription information; The event relevant to mouse action; Create embedded floating framework ifram label, specify a container button label, specify text and image are shown between two parties to center label, framework frame label, framework collection frameset label or on the page, place executable content applet label; Length for heading is greater than default length threshold; URL is not with www beginning, but the link number of the www comprising in web document content beginning is greater than default link number threshold value; Google ad code, and the width of google ad is greater than default width threshold value; Add collection code to; Webpage is arranged to the code of homepage; The meta that comprises MSThemeCompatible; The meta that comprises x-ua-compatible; The width property value of embedded or outer chain css is greater than 320 pixels; Be greater than at least one in default byte number threshold value with the size of web document content.
Preferably, the web page characteristics parameter in webpage visual Intelligence Page comprises:
In webpage, the width of each label node is all less than the first default width threshold value; Width is greater than the label node of the second default width threshold value; Lishu font or italic font; Float property value is that the number of the label node of right is greater than default label node number threshold value; The position of non-small icon class picture is positioned at default restricted area, and ratio between width and the width of whole webpage of non-small icon class picture is less than default width ratio threshold value; Be greater than at least one in the picture of the 3rd default width threshold value with width.
The device of the definite type of webpage described in the embodiment of the present invention, from the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.Further, by calculating the web page characteristics score of this webpage, according to the web page characteristics score of this webpage, determine the type of this webpage, can improve the accuracy of the type of determining this webpage.Further, by Web page classifying model, determine the types of web pages of this webpage, disaggregated model is to obtain by advance the web page characteristics parameter extracting is carried out to machine learning training from default multiple webpages, can improve the accuracy of the type of determining this webpage.
It should be noted that: the device of definite type of webpage that above-described embodiment provides is in the time of triggering intelligent network service, only be illustrated with the division of above-mentioned each functional module, in practical application, can above-mentioned functions be distributed and completed by different functional modules as required, be divided into different functional modules by the inner structure of equipment, to complete all or part of function described above.In addition, the device of definite type of webpage that above-described embodiment provides and the embodiment of the method for definite type of webpage belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that all or part of step that realizes above-described embodiment can complete by hardware, also can carry out the hardware that instruction is relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (12)

1. a method for definite type of webpage, is characterized in that, described method comprises:
Obtain the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
From the document information of described webpage, extract web page characteristics parameter;
According to the web page characteristics parameter extracting, determine the type of described webpage.
2. method according to claim 1, is characterized in that, the web page characteristics parameter that described basis extracts is determined and comprised the type of described webpage:
According to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of described webpage;
According to the web page characteristics score of described webpage, determine the type of described webpage.
3. method according to claim 2, is characterized in that, described according to the web page characteristics score of described webpage, determines the type of described webpage, comprising:
The web page characteristics score of described webpage and default web page characteristics score threshold value are compared;
If the web page characteristics score of described webpage is greater than default web page characteristics score threshold value, determine that the type of described webpage is WAP (wireless application protocol) wap2.0.
4. method according to claim 3, is characterized in that, described the web page characteristics score of described webpage and default web page characteristics score threshold value are compared after, also comprise:
If the web page characteristics score of described webpage is less than or equal to default web page characteristics score threshold value, determine that the type of described webpage is web.
5. method according to claim 1, is characterized in that, the web page characteristics parameter that described basis extracts is determined and comprised the type of described webpage:
The web page characteristics parameter extracting is input in Web page classifying model; Wherein, described Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
By described Web page classifying model, determine the type of described webpage.
6. according to the method described in the arbitrary claim of claim 1-5, it is characterized in that, the web page characteristics parameter in described web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
7. a device for definite type of webpage, is characterized in that, described device comprises:
Acquisition module, for obtaining the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
Extraction module for obtaining at described acquisition module after the document information of webpage, extracts web page characteristics parameter from the document information of described webpage;
Determination module, at described extraction module from the document information of described webpage extracts web page characteristics parameter, according to the web page characteristics parameter extracting, determine the type of described webpage.
8. device according to claim 7, is characterized in that, described determination module comprises:
Computing unit, at described extraction module from the document information of described webpage extracts web page characteristics parameter, according to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of described webpage;
Determining unit, for calculating at described computing unit after the web page characteristics score of described webpage, according to the web page characteristics score of described webpage, determines the type of described webpage.
9. device according to claim 8, is characterized in that, described determining unit comprises:
Relatively subelement, for calculating at described computing unit after the web page characteristics score of described webpage, compares the web page characteristics score of described webpage and default web page characteristics score threshold value;
First determines subelement, is greater than default web page characteristics score threshold value for the web page characteristics score that is described webpage at the described relatively comparative result of subelement, determines that the type of described webpage is WAP (wireless application protocol) wap7.0.
10. device according to claim 9, is characterized in that, described determining unit also comprises:
Second determines subelement, is less than or equal to default web page characteristics score threshold value for the web page characteristics score that is described webpage at the described relatively comparative result of subelement, determines that the type of described webpage is web.
11. devices according to claim 7, is characterized in that, described determination module comprises:
Processing unit, at described extraction module from the document information of described webpage extracts web page characteristics parameter, the web page characteristics parameter extracting is input in Web page classifying model; Wherein, described Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
Disaggregated model determining unit, for the web page characteristics parameter extracting being input to after Web page classifying model at described processing unit, by described Web page classifying model, determines the type of described webpage.
12. according to the device described in the arbitrary claim of claim 7-11, it is characterized in that, the web page characteristics parameter in described web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
CN201210539055.0A 2012-12-13 2012-12-13 Webpage type confirming method and device Pending CN103870486A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210539055.0A CN103870486A (en) 2012-12-13 2012-12-13 Webpage type confirming method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210539055.0A CN103870486A (en) 2012-12-13 2012-12-13 Webpage type confirming method and device

Publications (1)

Publication Number Publication Date
CN103870486A true CN103870486A (en) 2014-06-18

Family

ID=50909029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210539055.0A Pending CN103870486A (en) 2012-12-13 2012-12-13 Webpage type confirming method and device

Country Status (1)

Country Link
CN (1) CN103870486A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103383695A (en) * 2013-06-24 2013-11-06 百度在线网络技术(北京)有限公司 Method and equipment for identifying WAP web page
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN104392009A (en) * 2014-12-19 2015-03-04 北京奇虎科技有限公司 Method and device for acquiring mobile site link address
CN105138698A (en) * 2015-09-25 2015-12-09 百度在线网络技术(北京)有限公司 Dynamic layout method and device for webpages
CN106294881A (en) * 2016-08-30 2017-01-04 五八同城信息技术有限公司 information identifying method and device
CN107741942A (en) * 2016-12-09 2018-02-27 腾讯科技(深圳)有限公司 A kind of webpage content extracting method and device
CN108108366A (en) * 2016-11-24 2018-06-01 腾讯科技(深圳)有限公司 A kind of webpage classification recognition methods and device
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN110287409A (en) * 2019-06-05 2019-09-27 新华三信息安全技术有限公司 A kind of webpage type identification method and device
CN111639250A (en) * 2020-06-05 2020-09-08 深圳市小满科技有限公司 Enterprise description information acquisition method and device, electronic equipment and storage medium
CN112084410A (en) * 2020-09-10 2020-12-15 北京百度网讯科技有限公司 Document type recommendation method and device, electronic equipment and readable storage medium
US11074306B2 (en) 2016-12-09 2021-07-27 Tencent Technology (Shenzhen) Company Limited Web content extraction method, device, storage medium

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103383695A (en) * 2013-06-24 2013-11-06 百度在线网络技术(北京)有限公司 Method and equipment for identifying WAP web page
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
WO2015196740A1 (en) * 2014-06-25 2015-12-30 华南理工大学 Information forecast and acquisition method based on webpage link parameter analysis
CN104392009A (en) * 2014-12-19 2015-03-04 北京奇虎科技有限公司 Method and device for acquiring mobile site link address
CN105138698A (en) * 2015-09-25 2015-12-09 百度在线网络技术(北京)有限公司 Dynamic layout method and device for webpages
CN106294881A (en) * 2016-08-30 2017-01-04 五八同城信息技术有限公司 information identifying method and device
CN108108366A (en) * 2016-11-24 2018-06-01 腾讯科技(深圳)有限公司 A kind of webpage classification recognition methods and device
CN107741942B (en) * 2016-12-09 2020-06-02 腾讯科技(深圳)有限公司 Webpage content extraction method and device
CN107741942A (en) * 2016-12-09 2018-02-27 腾讯科技(深圳)有限公司 A kind of webpage content extracting method and device
US11074306B2 (en) 2016-12-09 2021-07-27 Tencent Technology (Shenzhen) Company Limited Web content extraction method, device, storage medium
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN108256104B (en) * 2018-02-05 2020-05-26 恒安嘉新(北京)科技股份公司 Comprehensive classification method of internet websites based on multidimensional characteristics
CN110287409A (en) * 2019-06-05 2019-09-27 新华三信息安全技术有限公司 A kind of webpage type identification method and device
CN110287409B (en) * 2019-06-05 2022-07-22 新华三信息安全技术有限公司 Webpage type identification method and device
CN111639250A (en) * 2020-06-05 2020-09-08 深圳市小满科技有限公司 Enterprise description information acquisition method and device, electronic equipment and storage medium
CN111639250B (en) * 2020-06-05 2023-05-16 深圳市小满科技有限公司 Enterprise description information acquisition method and device, electronic equipment and storage medium
CN112084410A (en) * 2020-09-10 2020-12-15 北京百度网讯科技有限公司 Document type recommendation method and device, electronic equipment and readable storage medium
CN112084410B (en) * 2020-09-10 2023-07-25 北京百度网讯科技有限公司 Document type recommendation method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN103870486A (en) Webpage type confirming method and device
CN102200971B (en) Method and equipment for realizing webpage content previewing
US7853871B2 (en) System and method for identifying segments in a web resource
Asakawa et al. Transcoding
US8196036B2 (en) Method and system for converting hypertext markup language web page to plain text
CN104461484B (en) The implementation method and device of front-end template
CN103166981B (en) A kind of radio web page code-transferring method and device
CN106371844A (en) Method and system for presenting webpage by native user interface assembly
CN100440127C (en) Method and apparatus for printing web page
CN105677764A (en) Information extraction method and device
CN107256234A (en) A kind of web page text method of adjustment and its equipment
JP2016522481A (en) Client-side page processing
CN102436454A (en) Input method switching method and system for browser
US20210042466A1 (en) Detecting compatible layouts for content-based native ads
CN101621862A (en) Method and device for positioning effective information rapidly for mobile phone browser
CN103207874A (en) Updated webpage content prompting method and system
CN105760542A (en) Display control method, terminal and server
CN104090869B (en) A kind of method and translation system for translating the network information
CN103365877B (en) Method and server to establishing catalogue after webpage progress transcoding
CN102314494A (en) Method and equipment for processing webpage contents
CN103136259A (en) Method and device for processing webpage contents based on content block identification
CN112800372B (en) Page loading method and device and electronic equipment
CN105938496A (en) Webpage content extraction method and apparatus
CN103617043A (en) Method and system with picture webpage data uploading function
CN103365920A (en) Method for displaying webpage, browser and mobile terminal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140618