CN103870486A - Webpage type confirming method and device - Google Patents
Webpage type confirming method and device Download PDFInfo
- Publication number
- CN103870486A CN103870486A CN201210539055.0A CN201210539055A CN103870486A CN 103870486 A CN103870486 A CN 103870486A CN 201210539055 A CN201210539055 A CN 201210539055A CN 103870486 A CN103870486 A CN 103870486A
- Authority
- CN
- China
- Prior art keywords
- web page
- webpage
- page characteristics
- type
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a webpage type confirming method and device and belongs to the technological field of communication. The webpage type confirming method comprises obtaining document information of the webpage, wherein the document information of the webpage comprises webpage address URL (Uniform Resource Locator) information, webpage document content information and webpage visual information; extracting webpage characteristic parameters from the webpage document information; confirming the webpage type according to the extracted webpage characteristic parameters. The webpage type confirming device comprises an obtaining module and an extracting module. According to the webpage type confirming method and device, the webpage characteristic parameters are extracted from the webpage document information (comprising the webpage address URL information, the webpage document content information and the webpage visual information), the extracting range of the webpage characteristic parameters is wide, a plurality of webpage characteristic parameters can be extracted, so that the webpage type can be confirmed according to the extracted webpage characteristic parameters.
Description
Technical field
The present invention relates to communication technical field, particularly a kind of method and apparatus of definite type of webpage.
Background technology
Along with developing rapidly of mobile Internet and mobile terminal (as mobile phone, panel computer etc.), except passing through, outside pc (person computer, personal computer) terminal accessing Internet, can also to pass through mobile terminal accessing Internet.Therefore except the web webpage for pc terminal, also there is the wap(Wireless Application Protocol for mobile terminal, WAP (wireless application protocol)) webpage.But in pc terminal, experience very badly for the wap webpage of mobile terminal, particularly wap1.0 webpage cannot be shown in pc terminal.Can not well on mobile terminal, show for the web webpage of pc terminal simultaneously.So for search engine, top priority in the time crawling webpage, it is the type of determining webpage, distinguishing current web page is wap webpage or web webpage, could not allow so the disagreeableness web webpage of mobile terminal is appeared in wap Webpage searching result, do not allow the disagreeableness wap webpage of pc terminal is appeared in web Webpage searching result simultaneously.
The method of existing definite type of webpage mainly comprises: the difference by wap webpage and web Web Page Tags language is determined, determined by web document content etc.
But realizing in process of the present invention, inventor finds that prior art at least exists following problem:
Difference by wap webpage and web Web Page Tags language is carried out definite method, can determine and use WML(Wireless Markup Language, WAP Markup Language) markup language Wap1.0 webpage and use HTML(Hyper Text Mark-up Language, HTML (Hypertext Markup Language)) the web webpage (difference of WML and two kinds of markup languages of HTML is very large) of markup language, use XHTML(eXtensible HyperText Markup Language but cannot determine, extensible HyperText Markup Language) the wap2.0 webpage (difference of XHTML and two kinds of markup languages of HTML is very little) of markup language and the web webpage of use HTML markup language.
The statement of DOCTYPE html PUBLIC " //WAPFORUM//DTD XHTML Mobile ", can determine wap2.0 webpage or web webpage by this statement for the webpage of writing according to specification, but in fact most of webpage is all nonstandard, cannot determine wap2.0 webpage or web webpage.
Summary of the invention
In order to solve the problem of prior art, the embodiment of the present invention provides a kind of method and apparatus of definite type of webpage.Described technical scheme is as follows:
On the one hand, provide a kind of method of definite type of webpage, described method comprises:
Obtain the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
From the document information of described webpage, extract web page characteristics parameter;
According to the web page characteristics parameter extracting, determine the type of described webpage.
On the other hand, provide a kind of device of definite type of webpage, described device comprises:
Acquisition module, for obtaining the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
Extraction module for obtaining at described acquisition module after the document information of webpage, extracts web page characteristics parameter from the document information of described webpage;
Determination module, at described extraction module from the document information of described webpage extracts web page characteristics parameter, according to the web page characteristics parameter extracting, determine the type of described webpage.
The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:
From the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.
Brief description of the drawings
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing of required use during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the method flow diagram of definite type of webpage of providing of the embodiment of the present invention one;
Fig. 2 is the method flow diagram of definite type of webpage of providing of the embodiment of the present invention two;
Fig. 3 is the apparatus structure schematic diagram of definite type of webpage of providing of the embodiment of the present invention three.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Embodiment mono-
The embodiment of the present invention provides a kind of method of definite type of webpage, and referring to Fig. 1, the method comprises:
101: the document information that obtains webpage.
Wherein, the document information of webpage comprises web page address URL information, web document content information and webpage visual information.
102: from the document information of webpage, extract web page characteristics parameter.
103: according to the web page characteristics parameter extracting, determine the type of this webpage.
Preferably, according to the web page characteristics parameter extracting, determine the type of this webpage, comprising:
According to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of this webpage;
According to the web page characteristics score of this webpage, determine the type of this webpage.
Preferably, according to the web page characteristics score of this webpage, determine the type of this webpage, comprising:
The web page characteristics score of this webpage and default web page characteristics score threshold value are compared;
If the web page characteristics score of this webpage is greater than default web page characteristics score threshold value, determine that the type of this webpage is WAP (wireless application protocol) wap2.0.
Preferably, after the web page characteristics score of this webpage and default web page characteristics score threshold value are compared, also comprise:
If the web page characteristics score of this webpage is less than or equal to default web page characteristics score threshold value, determine that the type of this webpage is web.
Preferably, according to the web page characteristics parameter extracting, determine the type of this webpage, comprising:
The web page characteristics parameter extracting is input in Web page classifying model; Wherein, Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
By Web page classifying model, determine the type of this webpage.
Preferably, the web page characteristics parameter in web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
Preferably, the web page characteristics parameter in web document content information comprises:
The xhtml mobile printed words that comprise in doctype statement; The meta that comprises MobileOptimized; < xml statement; The meta that comprises viewport, and in meta, comprise width=deviec-width or width=xxx; Give the correct time the time; Letter version, colored steel, touch screen version or dazzle color edition printed words; Containing the outer chain CSS (cascading style sheet) list css filename of wap, phone, mob or 3g printed words; The mobile phone version, 3g version or the wap version printed words that in title, comprise; Character code is ASCII, GB2312, GBK or BIG5 coding; The number of outer chain css file is greater than default css file number threshold value; The number of outer chain js file is greater than default js file number threshold value; Ratio between the number of all labels that comprise in the number of form table label, form main body tbody label, table row tr label and list data td label and web document content is greater than default label proportion threshold value; Width is greater than the html label of default width threshold value; The number of img label is greater than default img label number threshold value; JavaScript code; The dtd html printed words that comprise in doctype statement; Information fusion RSS subscription information; The event relevant to mouse action; Create embedded floating framework ifram label, specify a container button label, specify text and image are shown between two parties to center label, framework frame label, framework collection frameset label or on the page, place executable content applet label; Length for heading is greater than default length threshold; URL is not with www beginning, but the link number of the www comprising in web document content beginning is greater than default link number threshold value; Google ad code, and the width of google ad is greater than default width threshold value; Add collection code to; Webpage is arranged to the code of homepage; The meta that comprises MSThemeCompatible; The meta that comprises x-ua-compatible; The width property value of embedded or outer chain css is greater than 320 pixels; Be greater than at least one in default byte number threshold value with the size of web document content.
Preferably, the web page characteristics parameter in webpage visual Intelligence Page comprises:
In webpage, the width of each label node is all less than the first default width threshold value; Width is greater than the label node of the second default width threshold value; Lishu font or italic font; Float property value is that the number of the label node of right is greater than default label node number threshold value; The position of non-small icon class picture is positioned at default restricted area, and ratio between width and the width of whole webpage of non-small icon class picture is less than default width ratio threshold value; Be greater than at least one in the picture of the 3rd default width threshold value with width.
The method of the definite type of webpage described in the embodiment of the present invention, from the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.Further, by calculating the web page characteristics score of this webpage, according to the web page characteristics score of this webpage, determine the type of this webpage, can improve the accuracy of the type of determining this webpage.Further, by Web page classifying model, determine the types of web pages of this webpage, disaggregated model is to obtain by advance the web page characteristics parameter extracting is carried out to machine learning training from default multiple webpages, can improve the accuracy of the type of determining this webpage.
Embodiment bis-
The embodiment of the present invention provides a kind of method of definite type of webpage, and referring to Fig. 2, the method comprises:
201: the URL(Universal Resource Locator that obtains webpage, URL(uniform resource locator) (web page address)) information, HTTP(Hyper Text Transfer Protocol, HTTP) heading information, web document content information and webpage visual information.
Wherein, web document content information can be the source code of web document, or the DOM(Document Object Model of webpage, DOM Document Object Model) tree.
202: from the URL information of webpage, HTTP heading information, web document content information, extract wap1.0 web page characteristics parameter, judge whether to extract wap1.0 web page characteristics parameter, if can, hold 206; Otherwise, carry out 203.
Wherein, wap1.0 web page characteristics parameter comprises:
" wml " printed words in HTTP heading; Or wml, card(card in web document content), go(represents to jump to the action of new card) etc. the distinctive label of WML.
Particularly, if can extract " wml " printed words from HTTP heading, prove to comprise in HTTP heading information " wml " printed words; If can extract the distinctive labels of WML such as wml, card, go from web document content, prove to comprise in web document content information the distinctive labels of WML such as wml, card, go.If comprise " wml " printed words in HTTP heading, or in web document content, comprise the distinctive arbitrary labels of WML such as wml, card or go, this webpage is exactly wap1.0 webpage.
203: from URL information, web document content information and the webpage visual information of webpage, extract wap2.0 web page characteristics parameter, judge whether to extract wap2.0 web page characteristics parameter, if can, carry out 204; Otherwise, carry out 208.
Wherein, Wap2.0 web page characteristics parameter in URL information comprises: " //wap. ", "/wap/ " in URL, " //wap ", " // 3g(3rd-generation; 3G (Third Generation) Moblie technology). ", " .3g. ", "/3g/ ", " //m(mobil; mobile phone). ", the printed words such as "/m/ ", " .mobi(mobil, mobile phone)/";
Doctype(document type, Doctype) " xhtml mobile " printed words of comprising in statement; In web document content, comprise the meta(metadata of MobileOptimized (webpage is done width optimization setting for mobile device)); " < xml " statement comprising in web document content; In web document content, comprise viewport(window) meta, and in meta, comprise " width=deviec-width " or " width=xxx " printed words (wherein the value of xxx is less than or equal to 320 pixels); What the afterbody of web document content comprised gives the correct time the time; The printed words such as " letter version ", " colored steel ", " touch screen version ", " dazzling color version " that comprise in web document content; In web document content, contain the outer chain css(Cascading Style Sheet of " wap ", " phone ", " mob ", " 3g " printed words, CSS (cascading style sheet) list) filename; The printed words such as " mobile phone version ", " 3g version ", " wap version " that comprise in the title of web document content; The character code of web document content is ASCII(America Standard Code for Information Interchange, ASCII), non-UTF8 (simplified form of Chinese Character) coding such as GB2312 (Chinese Character Set Code for Informati (GB is the Pinyin abbreviation of " GB ", the 2312nd, GB sequence number)), GBK (expansion (K is the Pinyin abbreviation of " expansion ") of GB2312), BIG5 (Chinese-traditional); The number of the outer chain css file of web document content is greater than default css file number threshold value; The outer chain js(JavaScript of web document content) file number is greater than default js file number threshold value; The table(form comprising in web document content), tbody(table body, form main body), tr(table row, table row) and td(table data, list data) ratio between the number of all labels of comprising in the number of label and web document is greater than default label proportion threshold value; The width comprising in web document content is greater than the html label of default width threshold value (such as 320 pixels); The img(picture that web document content comprises) number of label is greater than default img label number threshold value; The JavaScript code comprising in web document content; In doctype statement, comprise " dtd(document type, Doctype) html " printed words; Doctype statement; The RSS(information fusion comprising in web document content) subscription information; The event relevant to mouse action comprising in web document content; The wap2.0 webpage comprising in web document content is not advised the label using, such as iframe(creates embedded floating framework), button(specifies a container), center(shows specify text and image between two parties), frame(framework), frameset(framework collection), applet(places executable content on the page) etc.; The length for heading of web document content is greater than default length threshold; URL is not with www beginning, but the link number of the www comprising in web document content beginning is greater than default link number threshold value; The googlead(advertisement comprising in web document content) code, and the width of google ad is greater than default width threshold value; The code of similar effects such as " adding collection to " of comprising in web document content; The code of webpage being arranged to homepage comprising in web document content; In web document content, comprise " MSThemeCompatible(XP(experience, experience) theme) " meta; In web document content, comprise the meta of " x-ua-compatible (for webpage specified documents pattern) "; The property value of the label comprising in web document does not use double quotation marks to bracket (do not use double quotation marks to bracket there is no double quotation marks or double quotation marks is not paired appearance) (property value of the html grammar request label of specification must bracket with double quotation marks); The width(width of the embedded or outer chain css comprising in web document content) property value is greater than the pattern of 320 pixels; The size of web document content is greater than default byte number threshold value (such as 500KB).
Wap2.0 web page characteristics parameter in webpage visual information comprises: in webpage, the width of each label node is all less than the first default width threshold value (such as 320 pixels); The width comprising in webpage is greater than the label node of the second default width threshold value (such as 320 pixels); The font (such as lishu, italic etc.) that the part mobile terminal comprising in webpage is not supported; The float(that comprises in webpage floats) property value is the right(right side) the number of label node be greater than default label node number threshold value; The position of the non-small icon class picture comprising in webpage is positioned at default restricted area and (comprises left restricted area and right restricted area as default restricted area is set; If non-small icon class picture is positioned at left restricted area, the location comparison of non-small icon class picture is to the left; If non-small icon class picture is positioned at right restricted area, the location comparison of non-small icon class picture is to the right), and ratio between width and the width of whole webpage of non-small icon class picture is less than default width ratio threshold value (such as <0.4); The width comprising in webpage is greater than the picture of default the 3rd width threshold value (such as 320 pixels).
Particularly, the Wap2.0 web page characteristics parameter in webpage visual information can be by setting up dom tree to web document content information, parses the width of each label node, highly, obtain after position and style information.
It should be noted that, each above-mentioned threshold value can arrange concrete value according to practical application situation, and this is not specifically limited.
204: according to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate wap2.0 web page characteristics score.
Particularly, can, according to having when webpage after this web page characteristics parameter, be the possibility of wap2.0 webpage, mark corresponding to this web page characteristics parameter is set.If had after this web page characteristics parameter, for the possibility of wap2.0 webpage is large, mark corresponding to this web page characteristics parameter is for just dividing (this web page characteristics parameter can be called positive web page characteristics parameter).If had after this web page characteristics parameter, be not that the possibility of wap2.0 webpage is large, mark corresponding to this web page characteristics parameter is negative point (this web page characteristics parameter can be called negative reticulum page characteristic parameter).And according to the size of possibility and impossibility, the size of mark is set.
In the wap2.0 web page characteristics parameter of extracting the above-mentioned URL information from webpage, web document content information and webpage visual information, the printed words such as " //wap. ", "/wap/ " in URL, " //wap ", " // 3g. ", " .3g. ", "/3g/ ", " //m. ", "/m/ ", " .mobi/ " are positive web page characteristics parameters." xhtml mobile " printed words that comprise in doctype statement ... the printed words such as " mobile phone version ", " 3g version ", " wap version " that comprise in the title of web document content are positive web page characteristics parameters, and other features are negative reticulum page characteristic parameters.In webpage in webpage visual information, to be all less than the first default width threshold value (such as 320 pixels) be positive web page characteristics parameter to the width of each label node, and other are negative reticulum page characteristic parameters.
Particularly, can, by web page characteristics parameter and web page characteristics mark corresponding saving one by one, when needed, read inquiry from the place of preserving.
Because wap2.0 web page characteristics parameter is unlike wap1.0 web page characteristics parameter, after meeting single characteristic parameter, can not determine it is wap2.0 webpage or web webpage completely, therefore adopt the mode of the comprehensive marking of all characteristic parameters is finally determined to webpage is wap2.0 webpage or web webpage.
205: by wap2.0 web page characteristics score, compare with default web page characteristics score threshold value, if be greater than default web page characteristics score threshold value, carry out 207; Otherwise, carry out 208.
206: the type of determining webpage is wap1.0, then finishes.
207: the type of determining webpage is wap2.0, then finishes.
208: the type of determining webpage is web, then finishes.
It should be noted that, be not limited to the mode by calculating wap2.0 web page characteristics score, the type of determining this webpage is wap2.0 or web, can also be in the following way:
The wap2.0 web page characteristics parameter extracting is input in Web page classifying model; Wherein, Web page classifying model obtains by the wap2.0 web page characteristics parameter extracting from default multiple webpages is carried out to machine learning.
By Web page classifying model, the type of determining this webpage is wap2.0 or web.
Wherein, the method for machine learning can be selected support vector machine (SVM), simple Bei Yesi etc., and this is not specifically limited.
The method of the definite type of webpage described in the embodiment of the present invention, from the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.Further, by calculating the web page characteristics score of this webpage, according to the web page characteristics score of this webpage, determine the type of this webpage, can improve the accuracy of the type of determining this webpage.Further, by Web page classifying model, determine the types of web pages of this webpage, disaggregated model is to obtain by advance the web page characteristics parameter extracting is carried out to machine learning training from default multiple webpages, can improve the accuracy of the type of determining this webpage.
Embodiment tri-
Referring to Fig. 3, the embodiment of the present invention provides a kind of device of definite type of webpage, it is characterized in that, this device comprises:
Preferably, determination module 303 comprises:
Computing unit, at extraction module 302 from the document information of webpage extracts web page characteristics parameter, according to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of webpage;
Determining unit, for calculating after the web page characteristics score of webpage at computing unit, according to the web page characteristics score of webpage, determines the type of webpage.
Preferably, determining unit comprises:
Relatively subelement, for calculating at computing unit after the web page characteristics score of webpage, compares the web page characteristics score of webpage and default web page characteristics score threshold value;
First determines subelement, is greater than default web page characteristics score threshold value for the web page characteristics score that is webpage at the comparative result that compares subelement, determines that the type of webpage is WAP (wireless application protocol) wap7.0.
Preferably, determining unit also comprises:
Second determines subelement, is less than or equal to default web page characteristics score threshold value for the web page characteristics score that is webpage at the comparative result that compares subelement, determines that the type of webpage is web.
Preferably, determination module 303 comprises:
Processing unit, at extraction module 302 from the document information of webpage extracts web page characteristics parameter, the web page characteristics parameter extracting is input in Web page classifying model; Wherein, Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
Disaggregated model determining unit, at processing unit, the web page characteristics parameter extracting being input to after Web page classifying model, by Web page classifying model, determines the type of webpage.
Preferably, the web page characteristics parameter in web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
Preferably, the web page characteristics parameter in web document content information comprises:
The xhtml mobile printed words that comprise in doctype statement; The meta that comprises MobileOptimized; < xml statement; The meta that comprises viewport, and in meta, comprise width=deviec-width or width=xxx; Give the correct time the time; Letter version, colored steel, touch screen version or dazzle color edition printed words; Containing the outer chain CSS (cascading style sheet) list css filename of wap, phone, mob or 3g printed words; The mobile phone version, 3g version or the wap version printed words that in title, comprise; Character code is ASCII, GB2312, GBK or BIG5 coding; The number of outer chain css file is greater than default css file number threshold value; The number of outer chain js file is greater than default js file number threshold value; Ratio between the number of all labels that comprise in the number of form table label, form main body tbody label, table row tr label and list data td label and web document content is greater than default label proportion threshold value; Width is greater than the html label of default width threshold value; The number of img label is greater than default img label number threshold value; JavaScript code; The dtd html printed words that comprise in doctype statement; Information fusion RSS subscription information; The event relevant to mouse action; Create embedded floating framework ifram label, specify a container button label, specify text and image are shown between two parties to center label, framework frame label, framework collection frameset label or on the page, place executable content applet label; Length for heading is greater than default length threshold; URL is not with www beginning, but the link number of the www comprising in web document content beginning is greater than default link number threshold value; Google ad code, and the width of google ad is greater than default width threshold value; Add collection code to; Webpage is arranged to the code of homepage; The meta that comprises MSThemeCompatible; The meta that comprises x-ua-compatible; The width property value of embedded or outer chain css is greater than 320 pixels; Be greater than at least one in default byte number threshold value with the size of web document content.
Preferably, the web page characteristics parameter in webpage visual Intelligence Page comprises:
In webpage, the width of each label node is all less than the first default width threshold value; Width is greater than the label node of the second default width threshold value; Lishu font or italic font; Float property value is that the number of the label node of right is greater than default label node number threshold value; The position of non-small icon class picture is positioned at default restricted area, and ratio between width and the width of whole webpage of non-small icon class picture is less than default width ratio threshold value; Be greater than at least one in the picture of the 3rd default width threshold value with width.
The device of the definite type of webpage described in the embodiment of the present invention, from the document information (comprising web page address URL information, web document content information and webpage visual information) of the webpage that obtains, extract web page characteristics parameter, the scope of web page characteristics parameter extraction is wide, can extract multiple web page characteristics parameters, make according to the web page characteristics parameter extracting, effectively to determine the type of this webpage.Further, by calculating the web page characteristics score of this webpage, according to the web page characteristics score of this webpage, determine the type of this webpage, can improve the accuracy of the type of determining this webpage.Further, by Web page classifying model, determine the types of web pages of this webpage, disaggregated model is to obtain by advance the web page characteristics parameter extracting is carried out to machine learning training from default multiple webpages, can improve the accuracy of the type of determining this webpage.
It should be noted that: the device of definite type of webpage that above-described embodiment provides is in the time of triggering intelligent network service, only be illustrated with the division of above-mentioned each functional module, in practical application, can above-mentioned functions be distributed and completed by different functional modules as required, be divided into different functional modules by the inner structure of equipment, to complete all or part of function described above.In addition, the device of definite type of webpage that above-described embodiment provides and the embodiment of the method for definite type of webpage belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that all or part of step that realizes above-described embodiment can complete by hardware, also can carry out the hardware that instruction is relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.
Claims (12)
1. a method for definite type of webpage, is characterized in that, described method comprises:
Obtain the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
From the document information of described webpage, extract web page characteristics parameter;
According to the web page characteristics parameter extracting, determine the type of described webpage.
2. method according to claim 1, is characterized in that, the web page characteristics parameter that described basis extracts is determined and comprised the type of described webpage:
According to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of described webpage;
According to the web page characteristics score of described webpage, determine the type of described webpage.
3. method according to claim 2, is characterized in that, described according to the web page characteristics score of described webpage, determines the type of described webpage, comprising:
The web page characteristics score of described webpage and default web page characteristics score threshold value are compared;
If the web page characteristics score of described webpage is greater than default web page characteristics score threshold value, determine that the type of described webpage is WAP (wireless application protocol) wap2.0.
4. method according to claim 3, is characterized in that, described the web page characteristics score of described webpage and default web page characteristics score threshold value are compared after, also comprise:
If the web page characteristics score of described webpage is less than or equal to default web page characteristics score threshold value, determine that the type of described webpage is web.
5. method according to claim 1, is characterized in that, the web page characteristics parameter that described basis extracts is determined and comprised the type of described webpage:
The web page characteristics parameter extracting is input in Web page classifying model; Wherein, described Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
By described Web page classifying model, determine the type of described webpage.
6. according to the method described in the arbitrary claim of claim 1-5, it is characterized in that, the web page characteristics parameter in described web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
7. a device for definite type of webpage, is characterized in that, described device comprises:
Acquisition module, for obtaining the document information of webpage; Wherein, the document information of described webpage comprises web page address URL information, web document content information and webpage visual information;
Extraction module for obtaining at described acquisition module after the document information of webpage, extracts web page characteristics parameter from the document information of described webpage;
Determination module, at described extraction module from the document information of described webpage extracts web page characteristics parameter, according to the web page characteristics parameter extracting, determine the type of described webpage.
8. device according to claim 7, is characterized in that, described determination module comprises:
Computing unit, at described extraction module from the document information of described webpage extracts web page characteristics parameter, according to default web page characteristics parameter and web page characteristics mark corresponding relation, calculate the web page characteristics score of described webpage;
Determining unit, for calculating at described computing unit after the web page characteristics score of described webpage, according to the web page characteristics score of described webpage, determines the type of described webpage.
9. device according to claim 8, is characterized in that, described determining unit comprises:
Relatively subelement, for calculating at described computing unit after the web page characteristics score of described webpage, compares the web page characteristics score of described webpage and default web page characteristics score threshold value;
First determines subelement, is greater than default web page characteristics score threshold value for the web page characteristics score that is described webpage at the described relatively comparative result of subelement, determines that the type of described webpage is WAP (wireless application protocol) wap7.0.
10. device according to claim 9, is characterized in that, described determining unit also comprises:
Second determines subelement, is less than or equal to default web page characteristics score threshold value for the web page characteristics score that is described webpage at the described relatively comparative result of subelement, determines that the type of described webpage is web.
11. devices according to claim 7, is characterized in that, described determination module comprises:
Processing unit, at described extraction module from the document information of described webpage extracts web page characteristics parameter, the web page characteristics parameter extracting is input in Web page classifying model; Wherein, described Web page classifying model obtains by the web page characteristics parameter extracting from default multiple webpages is carried out to machine learning;
Disaggregated model determining unit, for the web page characteristics parameter extracting being input to after Web page classifying model at described processing unit, by described Web page classifying model, determines the type of described webpage.
12. according to the device described in the arbitrary claim of claim 7-11, it is characterized in that, the web page characteristics parameter in described web page address URL information comprises:
At least one in //wap. ,/wap/, //wap, // 3g. .3g. ,/3g/, //m. ,/m/ and .mobi/.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210539055.0A CN103870486A (en) | 2012-12-13 | 2012-12-13 | Webpage type confirming method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210539055.0A CN103870486A (en) | 2012-12-13 | 2012-12-13 | Webpage type confirming method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103870486A true CN103870486A (en) | 2014-06-18 |
Family
ID=50909029
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210539055.0A Pending CN103870486A (en) | 2012-12-13 | 2012-12-13 | Webpage type confirming method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103870486A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103383695A (en) * | 2013-06-24 | 2013-11-06 | 百度在线网络技术(北京)有限公司 | Method and equipment for identifying WAP web page |
CN104090931A (en) * | 2014-06-25 | 2014-10-08 | 华南理工大学 | Information prediction and acquisition method based on webpage link parameter analysis |
CN104392009A (en) * | 2014-12-19 | 2015-03-04 | 北京奇虎科技有限公司 | Method and device for acquiring mobile site link address |
CN105138698A (en) * | 2015-09-25 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | Dynamic layout method and device for webpages |
CN106294881A (en) * | 2016-08-30 | 2017-01-04 | 五八同城信息技术有限公司 | information identifying method and device |
CN107741942A (en) * | 2016-12-09 | 2018-02-27 | 腾讯科技(深圳)有限公司 | A kind of webpage content extracting method and device |
CN108108366A (en) * | 2016-11-24 | 2018-06-01 | 腾讯科技(深圳)有限公司 | A kind of webpage classification recognition methods and device |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN110287409A (en) * | 2019-06-05 | 2019-09-27 | 新华三信息安全技术有限公司 | A kind of webpage type identification method and device |
CN111639250A (en) * | 2020-06-05 | 2020-09-08 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN112084410A (en) * | 2020-09-10 | 2020-12-15 | 北京百度网讯科技有限公司 | Document type recommendation method and device, electronic equipment and readable storage medium |
US11074306B2 (en) | 2016-12-09 | 2021-07-27 | Tencent Technology (Shenzhen) Company Limited | Web content extraction method, device, storage medium |
-
2012
- 2012-12-13 CN CN201210539055.0A patent/CN103870486A/en active Pending
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103383695A (en) * | 2013-06-24 | 2013-11-06 | 百度在线网络技术(北京)有限公司 | Method and equipment for identifying WAP web page |
CN104090931A (en) * | 2014-06-25 | 2014-10-08 | 华南理工大学 | Information prediction and acquisition method based on webpage link parameter analysis |
WO2015196740A1 (en) * | 2014-06-25 | 2015-12-30 | 华南理工大学 | Information forecast and acquisition method based on webpage link parameter analysis |
CN104392009A (en) * | 2014-12-19 | 2015-03-04 | 北京奇虎科技有限公司 | Method and device for acquiring mobile site link address |
CN105138698A (en) * | 2015-09-25 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | Dynamic layout method and device for webpages |
CN106294881A (en) * | 2016-08-30 | 2017-01-04 | 五八同城信息技术有限公司 | information identifying method and device |
CN108108366A (en) * | 2016-11-24 | 2018-06-01 | 腾讯科技(深圳)有限公司 | A kind of webpage classification recognition methods and device |
CN107741942B (en) * | 2016-12-09 | 2020-06-02 | 腾讯科技(深圳)有限公司 | Webpage content extraction method and device |
CN107741942A (en) * | 2016-12-09 | 2018-02-27 | 腾讯科技(深圳)有限公司 | A kind of webpage content extracting method and device |
US11074306B2 (en) | 2016-12-09 | 2021-07-27 | Tencent Technology (Shenzhen) Company Limited | Web content extraction method, device, storage medium |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN108256104B (en) * | 2018-02-05 | 2020-05-26 | 恒安嘉新(北京)科技股份公司 | Comprehensive classification method of internet websites based on multidimensional characteristics |
CN110287409A (en) * | 2019-06-05 | 2019-09-27 | 新华三信息安全技术有限公司 | A kind of webpage type identification method and device |
CN110287409B (en) * | 2019-06-05 | 2022-07-22 | 新华三信息安全技术有限公司 | Webpage type identification method and device |
CN111639250A (en) * | 2020-06-05 | 2020-09-08 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN111639250B (en) * | 2020-06-05 | 2023-05-16 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN112084410A (en) * | 2020-09-10 | 2020-12-15 | 北京百度网讯科技有限公司 | Document type recommendation method and device, electronic equipment and readable storage medium |
CN112084410B (en) * | 2020-09-10 | 2023-07-25 | 北京百度网讯科技有限公司 | Document type recommendation method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103870486A (en) | Webpage type confirming method and device | |
CN102200971B (en) | Method and equipment for realizing webpage content previewing | |
US7853871B2 (en) | System and method for identifying segments in a web resource | |
Asakawa et al. | Transcoding | |
US8196036B2 (en) | Method and system for converting hypertext markup language web page to plain text | |
CN104461484B (en) | The implementation method and device of front-end template | |
CN103166981B (en) | A kind of radio web page code-transferring method and device | |
CN106371844A (en) | Method and system for presenting webpage by native user interface assembly | |
CN100440127C (en) | Method and apparatus for printing web page | |
CN105677764A (en) | Information extraction method and device | |
CN107256234A (en) | A kind of web page text method of adjustment and its equipment | |
JP2016522481A (en) | Client-side page processing | |
CN102436454A (en) | Input method switching method and system for browser | |
US20210042466A1 (en) | Detecting compatible layouts for content-based native ads | |
CN101621862A (en) | Method and device for positioning effective information rapidly for mobile phone browser | |
CN103207874A (en) | Updated webpage content prompting method and system | |
CN105760542A (en) | Display control method, terminal and server | |
CN104090869B (en) | A kind of method and translation system for translating the network information | |
CN103365877B (en) | Method and server to establishing catalogue after webpage progress transcoding | |
CN102314494A (en) | Method and equipment for processing webpage contents | |
CN103136259A (en) | Method and device for processing webpage contents based on content block identification | |
CN112800372B (en) | Page loading method and device and electronic equipment | |
CN105938496A (en) | Webpage content extraction method and apparatus | |
CN103617043A (en) | Method and system with picture webpage data uploading function | |
CN103365920A (en) | Method for displaying webpage, browser and mobile terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140618 |