CN104123363A

CN104123363A - Method and device for extracting main image of webpage

Info

Publication number: CN104123363A
Application number: CN201410346226.7A
Authority: CN
Inventors: 陈华清; 许晟
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2014-07-21
Filing date: 2014-07-21
Publication date: 2014-10-29
Anticipated expiration: 2034-07-21
Also published as: CN104123363B

Abstract

The invention discloses a method and device for extracting a main image of a webpage. The method comprises the steps that HTML text of the webpage is obtained, simulation typesetting display is conducted on the HTML text, and visual information of each HTML element in the webpage is obtained; the HTML text is segmented with block information as a unit; text information in the block information is obtained, and image information is obtained from the block information according to the visual information; images meeting preset visual requirements are obtained according to the image information, an image meeting screening rules is further selected from the images meeting the preset visual requirements according to the text information and the image information, and the image is taken as the main image of the webpage. By means of the technical scheme, quite high accuracy and efficiency can be achieved on the selection of the main image.

Description

Webpage master map extracting method and device

Technical field

The present invention relates to field of computer technology, particularly relate to a kind of webpage master map extracting method and device.

Background technology

Along with the development of Internet technology, the form of expression of HTML (Hypertext Markup Language) (Hypertext Markup Language, referred to as HTML) webpage is more and more diversified, and one of trend is wherein exactly a large amount of appearance of picture in webpage.Compare with traditional word, picture has own unique advantage in arresting power with aspect expressing the meaning.Therefore at present a lot of search engines, in Search Results except title and summary are provided, also provide the master map extracting from webpage.

As shown in Figure 1, in the prior art, comprised increasing picture in the result of search engine, this identifies the own information that will find for user, improves clicking rate helpful.Simultaneously, aspect Internet advertising, compare the advertisement of pure input Text Link, display advertising has larger advantage, can allow user is very clear sees product information.Therefore, from webpage, extract master map technology and improving user search experience, improve clicking rate aspect and seem extremely important.Thereby be badly in need of at present a kind of webpage master map extracting method.

Summary of the invention

In view of the above problems, the present invention has been proposed to a kind of webpage master map extracting method and device that overcomes the problems referred to above or address the above problem is at least in part provided.

The invention provides a kind of webpage master map extracting method, comprising: obtain the html text of webpage, html text is simulated to typesetting and show, and obtain the visual information of each html element element in webpage; Html text be take to block message to be cut as unit; Obtain the text message in block message, and from block message, obtain pictorial information according to visual information; According to pictorial information, obtain the picture that meets predetermined vision requirement, and according to text message and pictorial information, from meet the picture of predetermined vision requirement, further select to meet the picture of screening rule, and the master map using this picture as webpage.

Preferably, the html text that obtains webpage specifically comprises: the html text that obtains webpage according to the uniform resource position mark URL of webpage.

Preferably, visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in webpage.

Preferably, text message comprises: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.

Preferably, pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.

Preferably, obtaining pictorial information specifically comprises: from block message, extract the URL of picture link and the explanatory text of picture; According to the algorithm priority setting in advance, calculate the length of picture and the width of picture; Ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.

Preferably, according to the algorithm priority that sets in advance calculate the length of picture and the width of picture specifically comprise following at least one: the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.

Preferably, predetermined vision requirement comprises: the position of picture is positioned at predetermined region, and the length and width of picture are big or small and Aspect Ratio meets pre-provisioning request.

Preferably, screening rule specifically comprise following at least one: using at webpage navigation bar or menu and long article the picture between this as master map; In the identical picture group sheet of size, select the first pictures as master map; Webpage to search results pages type, chooses the first pictures as master map; A pictures maximum in viewing area is as master map; Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map; When webpage is website homepage or special topic page, choose website logo as master map.

The present invention also provides a kind of webpage master map extraction element, comprising: webpage handling module, and for obtaining the html text of webpage, html text is simulated to typesetting and show, and obtain the visual information of each html element element in webpage; HTML parsing module, cuts as unit for html text being take to block message; Acquisition of information module for obtaining the text message of block message, and is obtained pictorial information from block message according to visual information; Screening module for obtain the picture that meets predetermined vision requirement according to pictorial information, and according to text message and pictorial information, further selects to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as webpage.

Preferably, webpage handling module is specifically for the html text that obtains webpage according to the uniform resource position mark URL of webpage.

Preferably, acquisition of information module specifically for: from block message, extract the URL of picture link and the explanatory text of picture; According to the algorithm priority setting in advance, calculate the length of picture and the width of picture; Ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.

Preferably, the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.

Beneficial effect of the present invention is as follows:

By pictorial information, the master map of webpage is carried out to candidate, and to the master map in Candidate Set, carry out selected according to screening rule, can make master map choose the accuracy rate that reaches very high, in addition, the technical scheme of the embodiment of the present invention is owing to adopting visual zone to position, the calculative picture of candidate is greatly reduced, greatly improved the extraction speed of master map.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.

Accompanying drawing explanation

By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:

Fig. 1 is the schematic diagram of searching plain engine results page display web page master map in prior art;

Fig. 2 is the process flow diagram of the webpage master map extracting method of the embodiment of the present invention;

Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention;

Fig. 4 is the schematic diagram of the master map Sample Filter 1 of the embodiment of the present invention;

Fig. 5 is the schematic diagram of the master map Sample Filter 2 of the embodiment of the present invention;

Fig. 6 is the schematic diagram of the master map Sample Filter 3 of the embodiment of the present invention;

Fig. 7 is the schematic diagram of the master map Sample Filter 4 of the embodiment of the present invention;

Fig. 8 is the schematic diagram of the master map Sample Filter 5 of the embodiment of the present invention;

Fig. 9 is the schematic diagram of the master map Sample Filter 6 of the embodiment of the present invention;

Figure 10 is the structural representation of the webpage master map extraction element of the embodiment of the present invention.

Embodiment

Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.

The method that extracts webpage master map can comprise following two kinds of modes:

Mode one: the statistics based on user behavior, the method is based on a kind of hypothesis, and the picture user in webpage clicks more much more important.Concrete technical scheme is as follows: first add up user's clicks of page upper all pictures of often throwing the net, subsequently, select user to click the highest picture as webpage master map.But there is following problem in technique scheme: 1, recall rate is not high: not all picture has user to click behavior, and some pictures are not link just.2, effective shortcoming: for emerging webpage, owing to there is no user behavior information, so cannot extract picture.3, degree of confidence problem: in the situation that picture number of clicks is less, easily occur deviation, and concerning a lot of little companies, cannot obtain user behavior data abundant as major company.4, user behavior deviation: if for example in webpage some pictures be some sexy women's pictures, can more attract eyeball, therefore obtain more click.

Mode two: based on machine learning classification method, concrete technical scheme is as follows: step 1, the feature of picture in extraction webpage, for example, picture size, the position in HTML, the descriptor of picture etc.; Step 2, prepares mark collection, chooses the webpage of some, and picture is wherein marked, and whether marks master map; Step 3, adopts disaggregated model training (for example, logistic regression, SVM, decision forest, GBDT etc.), obtains model; Step 4, utilizes the complete model of training to predict whether be master map to picture in webpage.But there is following problem in technique scheme: 1, mark needs a large amount of manpowers, cover dissimilar webpage, and the picture number in each webpage is a lot.2, need to select a large amount of features, for badcase, can not solve at once.3, need to calculate all pictures, calculated amount is larger.

In order to solve the aforementioned problems in the prior, the invention provides a kind of webpage master map extracting method and device, support online and two kinds of modes of off-line to extract master map; When online, only need to import webpage URL into, capture html text, and carry out typesetting displaying by browser display engine, through the parsing of html text is organized into the needed data structure of subsequent treatment and organizational form, the analysis of finally carrying out visual information and screening rule obtains webpage master map.Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, does not limit the present invention.

Embodiment of the method

According to embodiments of the invention, a kind of webpage master map extracting method is provided, Fig. 2 is the process flow diagram of the webpage master map extracting method of the embodiment of the present invention, as shown in Figure 2, according to the webpage master map extracting method of the embodiment of the present invention, comprises following processing:

S210, obtains the html text of webpage, html text is simulated to typesetting and show, and obtain the visual information of each html element element in webpage; Wherein, in embodiments of the present invention, visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in webpage.

Online and the two kinds of modes of off-line of embodiment of the present invention support extract master map; During off-line, need to get the html text of webpage, and can capture according to the URL of webpage when online, obtain online the html text of webpage.

S220, take block message by html text and cuts as unit; It should be noted that, above-mentioned block message refers to <DIV>, the HTML fragment that this class label of <TABLE> forms.

S230, obtains the text message in block message, and from block message, obtains pictorial information according to visual information; Wherein, above-mentioned text message can comprise: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.Pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.

That is to say, in S230, the pictorial information obtaining from block message according to visual information can be regarded treated more detailed a kind of visual information as.

In S230, obtain pictorial information and specifically comprise:

Step 1 is extracted the URL of picture link and the explanatory text of picture from block message;

Step 2, calculates the length of picture and the width of picture according to the algorithm priority setting in advance; Particularly: according to the algorithm priority that sets in advance calculate the length of picture and the width of picture specifically comprise following at least one: the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.

Step 3, the ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.

S240, according to pictorial information obtain meet predetermined vision requirement picture (for example, picture size meets: long (60～760), wide (60～760), Aspect Ratio meets the picture between (0.5～2.5)), and according to text message and pictorial information, further select to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as webpage.

In S240, predetermined vision requirement comprises: the position of picture is positioned at predetermined region, and the length and width of picture are big or small and Aspect Ratio meets pre-provisioning request.

Screening rule specifically comprise following at least one: using at webpage navigation bar or menu and long article the picture between this as master map; In the identical picture group sheet of size, select the first pictures as master map; Webpage to search results pages type, chooses the first pictures as master map; A pictures maximum in viewing area is as master map; Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map; When webpage is website homepage or special topic page, choose website logo as master map.

Below in conjunction with example and accompanying drawing, the technique scheme of the embodiment of the present invention is continued to describe in detail.

Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention, as shown in Figure 3, when online, only need to import webpage URL into, webpage handling module captures, and carry out typesetting displaying by browser display engine, then through HTML parsing module, resolve and be organized into the needed data structure of downstream module and organizational form, finally by visual information and rule base analysis module, analyze and obtain webpage master map.Each processing procedure below webpage master map extracting method being related to is elaborated:

Webpage handling module: different from traditional handling module based on CURL, WGET, http protocol, this module is not simply to obtain html text, need to obtain two aspect information: the one, html text; The 2nd, html text is carried out to typesetting displaying, JavaScript is supported in the behavior of simulation browser simultaneously, to obtain display location and the size (namely visual information) of each html element element in browser.

In embodiments of the present invention, the typesetting of webpage handling module is shown and can be realized by Phantomjs, Phantomjs is a kind of browser display engine, based on webkit kernel, have perfect Javascript parsing, page-rendering function, can be used for simulating the variety of event that a modern browser is done when Web page loading.

In addition, in embodiments of the present invention, the DOM structure that the visual information that webpage handling module is obtained can be accessed HTML by JavaScript is obtained:

var?actualLeft＝images[i].offsetLeft；

var?actualTop＝images[i].offsetTop；

var?current＝images[i].offsetParent；

while(current！＝＝null){

actualLeft+＝current.offsetLeft；

actualTop+＝current.offsetTop；

current＝current.offsetParent；}

HTML parsing module: HTML is resolved, with a finite state machine, html text is cut according to block message, be that webpage is carried out to structured organization the main order of doing like this, is the foundation stone of subsequent treatment.

For example,, to following HTML fragment

After resolving, become following data structure:

In above-mentioned example, block message is mainly that text and the hyperlink in piece forms.

Visual information and rule base analysis module: the block message first HTML parsing module being imported into is processed and obtain following two kinds of data structures (being the above-mentioned text message obtaining and pictorial information) from block message:

class?TextBlock:

def__init__(self):

class?ImageBlock:

In embodiments of the present invention, the length and width of picture are calculated and are according to priority divided three kinds: (that is,, if calculated not out, transferring to lower a kind of):

1, by the HTML mark in iconic marker, obtain picture length and width;

2, capturing pictures obtains picture length and width by ImageMagick;

3, by DOM information in phantomjs, obtain picture length and width.

Visual information and rule base analysis module also need to carry out picture based on vision and choose: particularly, and as shown in Figure 4, according to general knowledge, general webpage making person, when putting master map, generally all can be placed on people's vision foreground on webpage, the place that namely people the most easily sees.And webpage bottom, the position that needs roll mouse just can see, the area, corner of webpage, generally is seldom used for placing master map.

By the visual zone at a large amount of webpage master maps place is added up, the position at most master maps place is that horizontal ordinate is 0 to 700, and ordinate is 20 to 1000 (units: in rectangular region pixel).By this visual information, all pictures that meet a certain size in this region can be put in Candidate Set.For example, picture size meets: long (60～760), wide (60～760), Aspect Ratio meets between (0.5～2.5).Not meeting mostly shape and the big or small requirement that does not meet master map of picture of this condition, is generally icon, ad banner etc.

Finally, visual information and rule base analysis module also need rule-based image further to be screened:

Rule one: as shown in Figure 5, the picture being positioned between webpage navigation bar (or menu) and long article basis is generally master map;

Rule two: as shown in Figure 6, in the identical picture group sheet of size, choose first as master map;

Rule three: as shown in Figure 7, the webpage to search results pages type, choosing the first pictures is master map;

Rule four: as shown in Figure 8, choose a figure maximum in viewing area as master map;

Rule five: calculate the correlativity between picture description information and webpage TITLE, choose picture that correlativity is higher as master map: the relevance degree of general master map and webpage is very high, the TITLE of webpage concentrates the content of the webpage of expressing, if the correlativity of the descriptor of picture and webpage TITLE is very high, can think that this picture is master map.

Rule six: as shown in Figure 9, also can not find in qualified master map situation in above-mentioned rule, if this webpage is website homepage or special topic page, choose webpage and website LOGO as master map.

It should be noted that, the applicable relation of rule, can be optionally one or more, and can carry out with random order.Preferably, in a specific embodiment, can be suitable for these six rules simultaneously and carry out successively according to said sequence.

In sum, by means of the technical scheme of the embodiment of the present invention, do not need to rely on user behavior, for the single page, carry out master map extraction, do not have cold start-up problem, there is stronger adaptability; In addition, adopt the multiple rules such as visual information, the cognitive behavior of simulation people to master map, has higher accuracy rate; And, owing to adopting visual zone to position, the calculative picture of candidate is greatly reduced, greatly improved the extraction speed of master map.The technical scheme of the embodiment of the present invention has solved in prior art webpage master map and has extracted problem, makes it to be applied to search and shows result page, forms the abundanter form of expression together with the title of webpage, summary.And, enrich the form that represents of advertising creative, change single word chain and show, can also improve the clicking rate of advertisement.

Device embodiment

According to embodiments of the invention, a kind of webpage master map extraction element is provided, Figure 10 is the structural representation of the webpage master map extraction element of the embodiment of the present invention, as shown in figure 10, according to the webpage master map extraction element of the embodiment of the present invention, comprise: webpage handling module 100, HTML parsing module 102, acquisition of information module 104 and screening module 106, below be described in detail the modules of the embodiment of the present invention.

Webpage handling module 100, for obtaining the html text of webpage, simulates typesetting to html text and shows, and obtain the visual information of each html element element in webpage; Wherein, in embodiments of the present invention, visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in webpage.

Online and the two kinds of modes of off-line of embodiment of the present invention support extract master map; During off-line, webpage handling module 100 can directly get the html text of webpage, and webpage handling module 100 can capture according to the URL of webpage when online, obtains online the html text of webpage.

HTML parsing module 102, cuts as unit for html text being take to block message; It should be noted that, above-mentioned block message refers to <DIV>, the HTML fragment that this class label of <TABLE> forms.

Acquisition of information module 104 for obtaining the text message of block message, and is obtained pictorial information from block message according to visual information; Wherein, above-mentioned text message can comprise: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.Pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.

Acquisition of information module 104 specifically for: from block message, extract the URL of picture link and the explanatory text of picture; According to the algorithm priority setting in advance, calculate the length of picture and the width of picture; Ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.Wherein, the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.

Screening module 106, for according to pictorial information, obtain meet predetermined vision requirement picture (for example, picture size meets: long (60～760), wide (60～760), Aspect Ratio meets the picture between (0.5～2.5)), and according to text message and pictorial information, further select to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as webpage.

Wherein, predetermined vision requirement comprises: the position of picture is positioned at predetermined region, and the length and width of picture are big or small and Aspect Ratio meets pre-provisioning request.

In the webpage master map extraction element of the embodiment of the present invention, the concrete processing of modules can be understood with reference to the description in said method embodiment, do not repeat them here, wherein, the acquisition of information module 104 in the embodiment of the present invention and screening module 106 are equivalent to visual information and the rule base analysis module in embodiment of the method.

Obviously, those skilled in the art can carry out various changes and modification and not depart from the spirit and scope of the present invention the present invention.Like this, if within of the present invention these are revised and modification belongs to the scope of the claims in the present invention and equivalent technologies thereof, the present invention is also intended to comprise these changes and modification interior.

The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.

In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and can the module in the client in embodiment are adaptively changed and they are arranged in one or more clients different from this embodiment.Module in embodiment can be combined into a module, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or client.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.

In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.

All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the client that is loaded with sequence network address of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.

It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims

1. a webpage master map extracting method, is characterized in that, comprising:

Obtain the html text of webpage, described html text is simulated to typesetting and show, and obtain the visual information of each html element element in described webpage;

Described html text be take to block message to be cut as unit;

Obtain the text message in described block message, and from described block message, obtain pictorial information according to described visual information;

According to described pictorial information, obtain the picture that meets predetermined vision requirement, and according to described text message and described pictorial information, from meet the picture of predetermined vision requirement, further select to meet the picture of screening rule, and the master map using this picture as described webpage.

2. the method for claim 1, is characterized in that, the html text that obtains webpage specifically comprises: the html text that obtains webpage according to the uniform resource position mark URL of webpage.

3. the method for claim 1, is characterized in that, described visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in described webpage.

4. the method for claim 1, is characterized in that, described text message comprises: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.

5. the method for claim 1, it is characterized in that, described pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.

6. method as claimed in claim 5, is characterized in that, obtains pictorial information and specifically comprises:

From described block message, extract the URL of picture link and the explanatory text of picture;

According to the algorithm priority setting in advance, calculate the length of picture and the width of picture;

Ordinate according to picture described in described acquisition of vision information in simulation typesetting is shown and the described picture horizontal ordinate in simulation typesetting is shown.

7. method as claimed in claim 6, is characterized in that, according to the algorithm priority that sets in advance calculate the length of picture and the width of picture specifically comprise following at least one:

The algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture;

The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software;

The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.

8. the method for claim 1, is characterized in that, described predetermined vision requirement comprises: the position of described picture is positioned at predetermined region, and the length and width of described picture are big or small and Aspect Ratio meets pre-provisioning request.

9. the method for claim 1, is characterized in that, described screening rule specifically comprise following at least one:

Using the picture between webpage navigation bar or menu and long article basis as master map;

In the identical picture group sheet of size, select the first pictures as master map;

Webpage to search results pages type, chooses the first pictures as master map;

A pictures maximum in viewing area is as master map;

Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map;

When described webpage is website homepage or special topic page, choose website logo as master map.

10. a webpage master map extraction element, is characterized in that, comprising:

Webpage handling module, for obtaining the html text of webpage, simulates typesetting to described html text and shows, and obtain the visual information of each html element element in described webpage;

HTML parsing module, cuts as unit for described html text being take to block message;

Acquisition of information module for obtaining the text message of described block message, and is obtained pictorial information according to described visual information from described block message;

Screening module, for obtaining according to described pictorial information the picture that meets predetermined vision requirement, and according to described text message and described pictorial information, further select to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as described webpage.

11. devices as claimed in claim 10, is characterized in that, webpage handling module is specifically for the html text that obtains webpage according to the uniform resource position mark URL of webpage.

12. devices as claimed in claim 10, is characterized in that, described visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in described webpage.

13. devices as claimed in claim 10, is characterized in that, described text message comprises: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.

14. devices as claimed in claim 10, it is characterized in that, described pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.

15. devices as claimed in claim 14, is characterized in that, described acquisition of information module specifically for:

16. devices as claimed in claim 15, is characterized in that, the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.

17. devices as claimed in claim 10, is characterized in that, described predetermined vision requirement comprises: the position of described picture is positioned at predetermined region, and the length and width of described picture are big or small and Aspect Ratio meets pre-provisioning request.

18. devices as claimed in claim 10, is characterized in that, described screening rule specifically comprise following at least one:

Webpage to search results pages type, chooses the first pictures as master map;

A pictures maximum in viewing area is as master map;