CN104123363A - Method and device for extracting main image of webpage - Google Patents

Method and device for extracting main image of webpage Download PDF

Info

Publication number
CN104123363A
CN104123363A CN201410346226.7A CN201410346226A CN104123363A CN 104123363 A CN104123363 A CN 104123363A CN 201410346226 A CN201410346226 A CN 201410346226A CN 104123363 A CN104123363 A CN 104123363A
Authority
CN
China
Prior art keywords
picture
webpage
text
information
master map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410346226.7A
Other languages
Chinese (zh)
Other versions
CN104123363B (en
Inventor
陈华清
许晟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410346226.7A priority Critical patent/CN104123363B/en
Publication of CN104123363A publication Critical patent/CN104123363A/en
Application granted granted Critical
Publication of CN104123363B publication Critical patent/CN104123363B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Abstract

The invention discloses a method and device for extracting a main image of a webpage. The method comprises the steps that HTML text of the webpage is obtained, simulation typesetting display is conducted on the HTML text, and visual information of each HTML element in the webpage is obtained; the HTML text is segmented with block information as a unit; text information in the block information is obtained, and image information is obtained from the block information according to the visual information; images meeting preset visual requirements are obtained according to the image information, an image meeting screening rules is further selected from the images meeting the preset visual requirements according to the text information and the image information, and the image is taken as the main image of the webpage. By means of the technical scheme, quite high accuracy and efficiency can be achieved on the selection of the main image.

Description

Webpage master map extracting method and device
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of webpage master map extracting method and device.
Background technology
Along with the development of Internet technology, the form of expression of HTML (Hypertext Markup Language) (Hypertext Markup Language, referred to as HTML) webpage is more and more diversified, and one of trend is wherein exactly a large amount of appearance of picture in webpage.Compare with traditional word, picture has own unique advantage in arresting power with aspect expressing the meaning.Therefore at present a lot of search engines, in Search Results except title and summary are provided, also provide the master map extracting from webpage.
As shown in Figure 1, in the prior art, comprised increasing picture in the result of search engine, this identifies the own information that will find for user, improves clicking rate helpful.Simultaneously, aspect Internet advertising, compare the advertisement of pure input Text Link, display advertising has larger advantage, can allow user is very clear sees product information.Therefore, from webpage, extract master map technology and improving user search experience, improve clicking rate aspect and seem extremely important.Thereby be badly in need of at present a kind of webpage master map extracting method.
Summary of the invention
In view of the above problems, the present invention has been proposed to a kind of webpage master map extracting method and device that overcomes the problems referred to above or address the above problem is at least in part provided.
The invention provides a kind of webpage master map extracting method, comprising: obtain the html text of webpage, html text is simulated to typesetting and show, and obtain the visual information of each html element element in webpage; Html text be take to block message to be cut as unit; Obtain the text message in block message, and from block message, obtain pictorial information according to visual information; According to pictorial information, obtain the picture that meets predetermined vision requirement, and according to text message and pictorial information, from meet the picture of predetermined vision requirement, further select to meet the picture of screening rule, and the master map using this picture as webpage.
Preferably, the html text that obtains webpage specifically comprises: the html text that obtains webpage according to the uniform resource position mark URL of webpage.
Preferably, visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in webpage.
Preferably, text message comprises: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.
Preferably, pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.
Preferably, obtaining pictorial information specifically comprises: from block message, extract the URL of picture link and the explanatory text of picture; According to the algorithm priority setting in advance, calculate the length of picture and the width of picture; Ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.
Preferably, according to the algorithm priority that sets in advance calculate the length of picture and the width of picture specifically comprise following at least one: the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.
Preferably, predetermined vision requirement comprises: the position of picture is positioned at predetermined region, and the length and width of picture are big or small and Aspect Ratio meets pre-provisioning request.
Preferably, screening rule specifically comprise following at least one: using at webpage navigation bar or menu and long article the picture between this as master map; In the identical picture group sheet of size, select the first pictures as master map; Webpage to search results pages type, chooses the first pictures as master map; A pictures maximum in viewing area is as master map; Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map; When webpage is website homepage or special topic page, choose website logo as master map.
The present invention also provides a kind of webpage master map extraction element, comprising: webpage handling module, and for obtaining the html text of webpage, html text is simulated to typesetting and show, and obtain the visual information of each html element element in webpage; HTML parsing module, cuts as unit for html text being take to block message; Acquisition of information module for obtaining the text message of block message, and is obtained pictorial information from block message according to visual information; Screening module for obtain the picture that meets predetermined vision requirement according to pictorial information, and according to text message and pictorial information, further selects to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as webpage.
Preferably, webpage handling module is specifically for the html text that obtains webpage according to the uniform resource position mark URL of webpage.
Preferably, visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in webpage.
Preferably, text message comprises: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.
Preferably, pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.
Preferably, acquisition of information module specifically for: from block message, extract the URL of picture link and the explanatory text of picture; According to the algorithm priority setting in advance, calculate the length of picture and the width of picture; Ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.
Preferably, the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.
Preferably, predetermined vision requirement comprises: the position of picture is positioned at predetermined region, and the length and width of picture are big or small and Aspect Ratio meets pre-provisioning request.
Preferably, screening rule specifically comprise following at least one: using at webpage navigation bar or menu and long article the picture between this as master map; In the identical picture group sheet of size, select the first pictures as master map; Webpage to search results pages type, chooses the first pictures as master map; A pictures maximum in viewing area is as master map; Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map; When webpage is website homepage or special topic page, choose website logo as master map.
Beneficial effect of the present invention is as follows:
By pictorial information, the master map of webpage is carried out to candidate, and to the master map in Candidate Set, carry out selected according to screening rule, can make master map choose the accuracy rate that reaches very high, in addition, the technical scheme of the embodiment of the present invention is owing to adopting visual zone to position, the calculative picture of candidate is greatly reduced, greatly improved the extraction speed of master map.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 is the schematic diagram of searching plain engine results page display web page master map in prior art;
Fig. 2 is the process flow diagram of the webpage master map extracting method of the embodiment of the present invention;
Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention;
Fig. 4 is the schematic diagram of the master map Sample Filter 1 of the embodiment of the present invention;
Fig. 5 is the schematic diagram of the master map Sample Filter 2 of the embodiment of the present invention;
Fig. 6 is the schematic diagram of the master map Sample Filter 3 of the embodiment of the present invention;
Fig. 7 is the schematic diagram of the master map Sample Filter 4 of the embodiment of the present invention;
Fig. 8 is the schematic diagram of the master map Sample Filter 5 of the embodiment of the present invention;
Fig. 9 is the schematic diagram of the master map Sample Filter 6 of the embodiment of the present invention;
Figure 10 is the structural representation of the webpage master map extraction element of the embodiment of the present invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
The method that extracts webpage master map can comprise following two kinds of modes:
Mode one: the statistics based on user behavior, the method is based on a kind of hypothesis, and the picture user in webpage clicks more much more important.Concrete technical scheme is as follows: first add up user's clicks of page upper all pictures of often throwing the net, subsequently, select user to click the highest picture as webpage master map.But there is following problem in technique scheme: 1, recall rate is not high: not all picture has user to click behavior, and some pictures are not link just.2, effective shortcoming: for emerging webpage, owing to there is no user behavior information, so cannot extract picture.3, degree of confidence problem: in the situation that picture number of clicks is less, easily occur deviation, and concerning a lot of little companies, cannot obtain user behavior data abundant as major company.4, user behavior deviation: if for example in webpage some pictures be some sexy women's pictures, can more attract eyeball, therefore obtain more click.
Mode two: based on machine learning classification method, concrete technical scheme is as follows: step 1, the feature of picture in extraction webpage, for example, picture size, the position in HTML, the descriptor of picture etc.; Step 2, prepares mark collection, chooses the webpage of some, and picture is wherein marked, and whether marks master map; Step 3, adopts disaggregated model training (for example, logistic regression, SVM, decision forest, GBDT etc.), obtains model; Step 4, utilizes the complete model of training to predict whether be master map to picture in webpage.But there is following problem in technique scheme: 1, mark needs a large amount of manpowers, cover dissimilar webpage, and the picture number in each webpage is a lot.2, need to select a large amount of features, for badcase, can not solve at once.3, need to calculate all pictures, calculated amount is larger.
In order to solve the aforementioned problems in the prior, the invention provides a kind of webpage master map extracting method and device, support online and two kinds of modes of off-line to extract master map; When online, only need to import webpage URL into, capture html text, and carry out typesetting displaying by browser display engine, through the parsing of html text is organized into the needed data structure of subsequent treatment and organizational form, the analysis of finally carrying out visual information and screening rule obtains webpage master map.Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, does not limit the present invention.
Embodiment of the method
According to embodiments of the invention, a kind of webpage master map extracting method is provided, Fig. 2 is the process flow diagram of the webpage master map extracting method of the embodiment of the present invention, as shown in Figure 2, according to the webpage master map extracting method of the embodiment of the present invention, comprises following processing:
S210, obtains the html text of webpage, html text is simulated to typesetting and show, and obtain the visual information of each html element element in webpage; Wherein, in embodiments of the present invention, visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in webpage.
Online and the two kinds of modes of off-line of embodiment of the present invention support extract master map; During off-line, need to get the html text of webpage, and can capture according to the URL of webpage when online, obtain online the html text of webpage.
S220, take block message by html text and cuts as unit; It should be noted that, above-mentioned block message refers to <DIV>, the HTML fragment that this class label of <TABLE> forms.
S230, obtains the text message in block message, and from block message, obtains pictorial information according to visual information; Wherein, above-mentioned text message can comprise: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.Pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.
That is to say, in S230, the pictorial information obtaining from block message according to visual information can be regarded treated more detailed a kind of visual information as.
In S230, obtain pictorial information and specifically comprise:
Step 1 is extracted the URL of picture link and the explanatory text of picture from block message;
Step 2, calculates the length of picture and the width of picture according to the algorithm priority setting in advance; Particularly: according to the algorithm priority that sets in advance calculate the length of picture and the width of picture specifically comprise following at least one: the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.
Step 3, the ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.
S240, according to pictorial information obtain meet predetermined vision requirement picture (for example, picture size meets: long (60~760), wide (60~760), Aspect Ratio meets the picture between (0.5~2.5)), and according to text message and pictorial information, further select to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as webpage.
In S240, predetermined vision requirement comprises: the position of picture is positioned at predetermined region, and the length and width of picture are big or small and Aspect Ratio meets pre-provisioning request.
Screening rule specifically comprise following at least one: using at webpage navigation bar or menu and long article the picture between this as master map; In the identical picture group sheet of size, select the first pictures as master map; Webpage to search results pages type, chooses the first pictures as master map; A pictures maximum in viewing area is as master map; Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map; When webpage is website homepage or special topic page, choose website logo as master map.
Below in conjunction with example and accompanying drawing, the technique scheme of the embodiment of the present invention is continued to describe in detail.
Fig. 3 is the processing schematic diagram of the webpage master map extracting method of the embodiment of the present invention, as shown in Figure 3, when online, only need to import webpage URL into, webpage handling module captures, and carry out typesetting displaying by browser display engine, then through HTML parsing module, resolve and be organized into the needed data structure of downstream module and organizational form, finally by visual information and rule base analysis module, analyze and obtain webpage master map.Each processing procedure below webpage master map extracting method being related to is elaborated:
Webpage handling module: different from traditional handling module based on CURL, WGET, http protocol, this module is not simply to obtain html text, need to obtain two aspect information: the one, html text; The 2nd, html text is carried out to typesetting displaying, JavaScript is supported in the behavior of simulation browser simultaneously, to obtain display location and the size (namely visual information) of each html element element in browser.
In embodiments of the present invention, the typesetting of webpage handling module is shown and can be realized by Phantomjs, Phantomjs is a kind of browser display engine, based on webkit kernel, have perfect Javascript parsing, page-rendering function, can be used for simulating the variety of event that a modern browser is done when Web page loading.
In addition, in embodiments of the present invention, the DOM structure that the visual information that webpage handling module is obtained can be accessed HTML by JavaScript is obtained:
var?actualLeft=images[i].offsetLeft;
var?actualTop=images[i].offsetTop;
var?current=images[i].offsetParent;
while(current!==null){
actualLeft+=current.offsetLeft;
actualTop+=current.offsetTop;
current=current.offsetParent;}
HTML parsing module: HTML is resolved, with a finite state machine, html text is cut according to block message, be that webpage is carried out to structured organization the main order of doing like this, is the foundation stone of subsequent treatment.
For example,, to following HTML fragment
After resolving, become following data structure:
In above-mentioned example, block message is mainly that text and the hyperlink in piece forms.
Visual information and rule base analysis module: the block message first HTML parsing module being imported into is processed and obtain following two kinds of data structures (being the above-mentioned text message obtaining and pictorial information) from block message:
class?TextBlock:
def__init__(self):
class?ImageBlock:
In embodiments of the present invention, the length and width of picture are calculated and are according to priority divided three kinds: (that is,, if calculated not out, transferring to lower a kind of):
1, by the HTML mark in iconic marker, obtain picture length and width;
2, capturing pictures obtains picture length and width by ImageMagick;
3, by DOM information in phantomjs, obtain picture length and width.
Visual information and rule base analysis module also need to carry out picture based on vision and choose: particularly, and as shown in Figure 4, according to general knowledge, general webpage making person, when putting master map, generally all can be placed on people's vision foreground on webpage, the place that namely people the most easily sees.And webpage bottom, the position that needs roll mouse just can see, the area, corner of webpage, generally is seldom used for placing master map.
By the visual zone at a large amount of webpage master maps place is added up, the position at most master maps place is that horizontal ordinate is 0 to 700, and ordinate is 20 to 1000 (units: in rectangular region pixel).By this visual information, all pictures that meet a certain size in this region can be put in Candidate Set.For example, picture size meets: long (60~760), wide (60~760), Aspect Ratio meets between (0.5~2.5).Not meeting mostly shape and the big or small requirement that does not meet master map of picture of this condition, is generally icon, ad banner etc.
Finally, visual information and rule base analysis module also need rule-based image further to be screened:
Rule one: as shown in Figure 5, the picture being positioned between webpage navigation bar (or menu) and long article basis is generally master map;
Rule two: as shown in Figure 6, in the identical picture group sheet of size, choose first as master map;
Rule three: as shown in Figure 7, the webpage to search results pages type, choosing the first pictures is master map;
Rule four: as shown in Figure 8, choose a figure maximum in viewing area as master map;
Rule five: calculate the correlativity between picture description information and webpage TITLE, choose picture that correlativity is higher as master map: the relevance degree of general master map and webpage is very high, the TITLE of webpage concentrates the content of the webpage of expressing, if the correlativity of the descriptor of picture and webpage TITLE is very high, can think that this picture is master map.
Rule six: as shown in Figure 9, also can not find in qualified master map situation in above-mentioned rule, if this webpage is website homepage or special topic page, choose webpage and website LOGO as master map.
It should be noted that, the applicable relation of rule, can be optionally one or more, and can carry out with random order.Preferably, in a specific embodiment, can be suitable for these six rules simultaneously and carry out successively according to said sequence.
In sum, by means of the technical scheme of the embodiment of the present invention, do not need to rely on user behavior, for the single page, carry out master map extraction, do not have cold start-up problem, there is stronger adaptability; In addition, adopt the multiple rules such as visual information, the cognitive behavior of simulation people to master map, has higher accuracy rate; And, owing to adopting visual zone to position, the calculative picture of candidate is greatly reduced, greatly improved the extraction speed of master map.The technical scheme of the embodiment of the present invention has solved in prior art webpage master map and has extracted problem, makes it to be applied to search and shows result page, forms the abundanter form of expression together with the title of webpage, summary.And, enrich the form that represents of advertising creative, change single word chain and show, can also improve the clicking rate of advertisement.
Device embodiment
According to embodiments of the invention, a kind of webpage master map extraction element is provided, Figure 10 is the structural representation of the webpage master map extraction element of the embodiment of the present invention, as shown in figure 10, according to the webpage master map extraction element of the embodiment of the present invention, comprise: webpage handling module 100, HTML parsing module 102, acquisition of information module 104 and screening module 106, below be described in detail the modules of the embodiment of the present invention.
Webpage handling module 100, for obtaining the html text of webpage, simulates typesetting to html text and shows, and obtain the visual information of each html element element in webpage; Wherein, in embodiments of the present invention, visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in webpage.
Online and the two kinds of modes of off-line of embodiment of the present invention support extract master map; During off-line, webpage handling module 100 can directly get the html text of webpage, and webpage handling module 100 can capture according to the URL of webpage when online, obtains online the html text of webpage.
HTML parsing module 102, cuts as unit for html text being take to block message; It should be noted that, above-mentioned block message refers to <DIV>, the HTML fragment that this class label of <TABLE> forms.
Acquisition of information module 104 for obtaining the text message of block message, and is obtained pictorial information from block message according to visual information; Wherein, above-mentioned text message can comprise: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.Pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.
Acquisition of information module 104 specifically for: from block message, extract the URL of picture link and the explanatory text of picture; According to the algorithm priority setting in advance, calculate the length of picture and the width of picture; Ordinate according to acquisition of vision information picture in simulation typesetting is shown and the picture horizontal ordinate in simulation typesetting is shown.Wherein, the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.
Screening module 106, for according to pictorial information, obtain meet predetermined vision requirement picture (for example, picture size meets: long (60~760), wide (60~760), Aspect Ratio meets the picture between (0.5~2.5)), and according to text message and pictorial information, further select to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as webpage.
Wherein, predetermined vision requirement comprises: the position of picture is positioned at predetermined region, and the length and width of picture are big or small and Aspect Ratio meets pre-provisioning request.
Screening rule specifically comprise following at least one: using at webpage navigation bar or menu and long article the picture between this as master map; In the identical picture group sheet of size, select the first pictures as master map; Webpage to search results pages type, chooses the first pictures as master map; A pictures maximum in viewing area is as master map; Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map; When webpage is website homepage or special topic page, choose website logo as master map.
In the webpage master map extraction element of the embodiment of the present invention, the concrete processing of modules can be understood with reference to the description in said method embodiment, do not repeat them here, wherein, the acquisition of information module 104 in the embodiment of the present invention and screening module 106 are equivalent to visual information and the rule base analysis module in embodiment of the method.
In sum, by means of the technical scheme of the embodiment of the present invention, do not need to rely on user behavior, for the single page, carry out master map extraction, do not have cold start-up problem, there is stronger adaptability; In addition, adopt the multiple rules such as visual information, the cognitive behavior of simulation people to master map, has higher accuracy rate; And, owing to adopting visual zone to position, the calculative picture of candidate is greatly reduced, greatly improved the extraction speed of master map.The technical scheme of the embodiment of the present invention has solved in prior art webpage master map and has extracted problem, makes it to be applied to search and shows result page, forms the abundanter form of expression together with the title of webpage, summary.And, enrich the form that represents of advertising creative, change single word chain and show, can also improve the clicking rate of advertisement.
Obviously, those skilled in the art can carry out various changes and modification and not depart from the spirit and scope of the present invention the present invention.Like this, if within of the present invention these are revised and modification belongs to the scope of the claims in the present invention and equivalent technologies thereof, the present invention is also intended to comprise these changes and modification interior.
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the client in embodiment are adaptively changed and they are arranged in one or more clients different from this embodiment.Module in embodiment can be combined into a module, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or client.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the client that is loaded with sequence network address of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (18)

1. a webpage master map extracting method, is characterized in that, comprising:
Obtain the html text of webpage, described html text is simulated to typesetting and show, and obtain the visual information of each html element element in described webpage;
Described html text be take to block message to be cut as unit;
Obtain the text message in described block message, and from described block message, obtain pictorial information according to described visual information;
According to described pictorial information, obtain the picture that meets predetermined vision requirement, and according to described text message and described pictorial information, from meet the picture of predetermined vision requirement, further select to meet the picture of screening rule, and the master map using this picture as described webpage.
2. the method for claim 1, is characterized in that, the html text that obtains webpage specifically comprises: the html text that obtains webpage according to the uniform resource position mark URL of webpage.
3. the method for claim 1, is characterized in that, described visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in described webpage.
4. the method for claim 1, is characterized in that, described text message comprises: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.
5. the method for claim 1, it is characterized in that, described pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.
6. method as claimed in claim 5, is characterized in that, obtains pictorial information and specifically comprises:
From described block message, extract the URL of picture link and the explanatory text of picture;
According to the algorithm priority setting in advance, calculate the length of picture and the width of picture;
Ordinate according to picture described in described acquisition of vision information in simulation typesetting is shown and the described picture horizontal ordinate in simulation typesetting is shown.
7. method as claimed in claim 6, is characterized in that, according to the algorithm priority that sets in advance calculate the length of picture and the width of picture specifically comprise following at least one:
The algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture;
The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software;
The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.
8. the method for claim 1, is characterized in that, described predetermined vision requirement comprises: the position of described picture is positioned at predetermined region, and the length and width of described picture are big or small and Aspect Ratio meets pre-provisioning request.
9. the method for claim 1, is characterized in that, described screening rule specifically comprise following at least one:
Using the picture between webpage navigation bar or menu and long article basis as master map;
In the identical picture group sheet of size, select the first pictures as master map;
Webpage to search results pages type, chooses the first pictures as master map;
A pictures maximum in viewing area is as master map;
Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map;
When described webpage is website homepage or special topic page, choose website logo as master map.
10. a webpage master map extraction element, is characterized in that, comprising:
Webpage handling module, for obtaining the html text of webpage, simulates typesetting to described html text and shows, and obtain the visual information of each html element element in described webpage;
HTML parsing module, cuts as unit for described html text being take to block message;
Acquisition of information module for obtaining the text message of described block message, and is obtained pictorial information according to described visual information from described block message;
Screening module, for obtaining according to described pictorial information the picture that meets predetermined vision requirement, and according to described text message and described pictorial information, further select to meet the picture of screening rule from meet the picture of predetermined vision requirement, and the master map using this picture as described webpage.
11. devices as claimed in claim 10, is characterized in that, webpage handling module is specifically for the html text that obtains webpage according to the uniform resource position mark URL of webpage.
12. devices as claimed in claim 10, is characterized in that, described visual information comprises: positional information and the size information of each html element element in simulation typesetting is shown in described webpage.
13. devices as claimed in claim 10, is characterized in that, described text message comprises: non-hyperlink text length, hyperlink text length, hyperlink number, hyperlink array and picture array.
14. devices as claimed in claim 10, it is characterized in that, described pictorial information comprises: the URL of picture link, the explanatory text of picture, the width of the length of picture, picture, picture ordinate and the horizontal ordinate of picture in simulation typesetting is shown in simulation typesetting is shown.
15. devices as claimed in claim 14, is characterized in that, described acquisition of information module specifically for:
From described block message, extract the URL of picture link and the explanatory text of picture;
According to the algorithm priority setting in advance, calculate the length of picture and the width of picture;
Ordinate according to picture described in described acquisition of vision information in simulation typesetting is shown and the described picture horizontal ordinate in simulation typesetting is shown.
16. devices as claimed in claim 15, is characterized in that, the algorithm of limit priority is: by the HTML mark in iconic marker, obtain the length of picture and the width of picture; The algorithm of the second priority is: capturing pictures also obtains the length of picture and the width of picture by mapping software; The algorithm of the 3rd priority is: by the length of the document dbject model DOM acquisition of information picture in browser display engine and the width of picture.
17. devices as claimed in claim 10, is characterized in that, described predetermined vision requirement comprises: the position of described picture is positioned at predetermined region, and the length and width of described picture are big or small and Aspect Ratio meets pre-provisioning request.
18. devices as claimed in claim 10, is characterized in that, described screening rule specifically comprise following at least one:
Using the picture between webpage navigation bar or menu and long article basis as master map;
In the identical picture group sheet of size, select the first pictures as master map;
Webpage to search results pages type, chooses the first pictures as master map;
A pictures maximum in viewing area is as master map;
Calculate the explanatory text of picture and the correlativity between Web page subject, using the highest picture of correlativity as master map;
When described webpage is website homepage or special topic page, choose website logo as master map.
CN201410346226.7A 2014-07-21 2014-07-21 Webpage master map extracting method and device Active CN104123363B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410346226.7A CN104123363B (en) 2014-07-21 2014-07-21 Webpage master map extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410346226.7A CN104123363B (en) 2014-07-21 2014-07-21 Webpage master map extracting method and device

Publications (2)

Publication Number Publication Date
CN104123363A true CN104123363A (en) 2014-10-29
CN104123363B CN104123363B (en) 2018-07-13

Family

ID=51768774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410346226.7A Active CN104123363B (en) 2014-07-21 2014-07-21 Webpage master map extracting method and device

Country Status (1)

Country Link
CN (1) CN104123363B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376114A (en) * 2014-12-01 2015-02-25 百度在线网络技术(北京)有限公司 Search result displaying method and device
CN104699837A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method, device and server for selecting illustrated pictures of web pages
CN104881428A (en) * 2015-04-02 2015-09-02 广州神马移动信息科技有限公司 Information graph extracting and retrieving method and device for information graph webpages
CN106445997A (en) * 2016-07-20 2017-02-22 腾讯科技(北京)有限公司 Information processing method and server
CN106484913A (en) * 2016-10-26 2017-03-08 腾讯科技(深圳)有限公司 Method and server that a kind of Target Photo determines
CN106503059A (en) * 2016-09-27 2017-03-15 北京小米移动软件有限公司 Displayed page method for pushing and device
CN106547540A (en) * 2016-10-12 2017-03-29 惠州市德赛西威汽车电子股份有限公司 A kind of method for drafting of text button
CN107066596A (en) * 2017-04-19 2017-08-18 北京小米移动软件有限公司 The method and apparatus for generating link information
CN107766475A (en) * 2017-10-09 2018-03-06 李亚强 A kind of system of selection of info web master map and device
WO2018120575A1 (en) * 2016-12-30 2018-07-05 百度在线网络技术(北京)有限公司 Method and device for identifying main picture in web page
CN108399167A (en) * 2017-02-04 2018-08-14 百度在线网络技术(北京)有限公司 Webpage information extracting method and device
CN109685085A (en) * 2017-10-18 2019-04-26 阿里巴巴集团控股有限公司 A kind of master map extracting method and device
CN112084451A (en) * 2020-09-16 2020-12-15 哈尔滨工业大学 Webpage LOGO extraction system and method based on visual blocking
CN112597765A (en) * 2020-12-25 2021-04-02 四川长虹电器股份有限公司 Automatic movie and television topic generation method based on multi-mode features
CN116578763A (en) * 2023-07-11 2023-08-11 卓谨信息科技(常州)有限公司 Multisource information exhibition system based on generated AI cognitive model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
US20120117449A1 (en) * 2010-11-08 2012-05-10 Microsoft Corporation Creating and Modifying an Image Wiki Page
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103885959A (en) * 2012-12-20 2014-06-25 腾讯科技(深圳)有限公司 Webpage bookmark generating method and webpage bookmark generating device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
US20120117449A1 (en) * 2010-11-08 2012-05-10 Microsoft Corporation Creating and Modifying an Image Wiki Page
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103885959A (en) * 2012-12-20 2014-06-25 腾讯科技(深圳)有限公司 Webpage bookmark generating method and webpage bookmark generating device

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376114B (en) * 2014-12-01 2018-01-30 百度在线网络技术(北京)有限公司 A kind of search result methods of exhibiting and device
CN104376114A (en) * 2014-12-01 2015-02-25 百度在线网络技术(北京)有限公司 Search result displaying method and device
CN104699837A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method, device and server for selecting illustrated pictures of web pages
CN104881428A (en) * 2015-04-02 2015-09-02 广州神马移动信息科技有限公司 Information graph extracting and retrieving method and device for information graph webpages
CN106445997A (en) * 2016-07-20 2017-02-22 腾讯科技(北京)有限公司 Information processing method and server
CN106445997B (en) * 2016-07-20 2021-02-05 腾讯科技(北京)有限公司 Information processing method and server
CN106503059B (en) * 2016-09-27 2019-07-23 北京小米移动软件有限公司 Displayed page method for pushing and device
CN106503059A (en) * 2016-09-27 2017-03-15 北京小米移动软件有限公司 Displayed page method for pushing and device
CN106547540A (en) * 2016-10-12 2017-03-29 惠州市德赛西威汽车电子股份有限公司 A kind of method for drafting of text button
CN106484913A (en) * 2016-10-26 2017-03-08 腾讯科技(深圳)有限公司 Method and server that a kind of Target Photo determines
CN106484913B (en) * 2016-10-26 2021-09-07 腾讯科技(深圳)有限公司 Target picture determining method and server
WO2018120575A1 (en) * 2016-12-30 2018-07-05 百度在线网络技术(北京)有限公司 Method and device for identifying main picture in web page
CN108268488A (en) * 2016-12-30 2018-07-10 百度在线网络技术(北京)有限公司 The recognition methods of webpage master map and device
US10963690B2 (en) 2016-12-30 2021-03-30 Baidu Online Network Technology (Beijing) Co., Ltd. Method for identifying main picture in web page
CN108399167A (en) * 2017-02-04 2018-08-14 百度在线网络技术(北京)有限公司 Webpage information extracting method and device
CN107066596A (en) * 2017-04-19 2017-08-18 北京小米移动软件有限公司 The method and apparatus for generating link information
CN107766475A (en) * 2017-10-09 2018-03-06 李亚强 A kind of system of selection of info web master map and device
CN109685085A (en) * 2017-10-18 2019-04-26 阿里巴巴集团控股有限公司 A kind of master map extracting method and device
CN109685085B (en) * 2017-10-18 2023-09-26 阿里巴巴集团控股有限公司 Main graph extraction method and device
CN112084451A (en) * 2020-09-16 2020-12-15 哈尔滨工业大学 Webpage LOGO extraction system and method based on visual blocking
CN112597765A (en) * 2020-12-25 2021-04-02 四川长虹电器股份有限公司 Automatic movie and television topic generation method based on multi-mode features
CN116578763A (en) * 2023-07-11 2023-08-11 卓谨信息科技(常州)有限公司 Multisource information exhibition system based on generated AI cognitive model
CN116578763B (en) * 2023-07-11 2023-09-15 卓谨信息科技(常州)有限公司 Multisource information exhibition system based on generated AI cognitive model

Also Published As

Publication number Publication date
CN104123363B (en) 2018-07-13

Similar Documents

Publication Publication Date Title
CN104123363A (en) Method and device for extracting main image of webpage
CN105027121B (en) The five application page of the machine application is indexed
US9443014B2 (en) Custom web page themes
CN104077388A (en) Summary information extraction method and device based on search engine and search engine
CN105814557A (en) Systems and methods for providing context based definitions and translations of text
CN106202362A (en) Image recommendation method and image recommendation device
CN103714115A (en) Method and device for loading web page content
CN104750754A (en) Website industry classification method and server
CN102411614A (en) Display Of Image Search Results
CN102929939A (en) Personalized information supply method and device
WO2014000130A1 (en) Method or system for automated extraction of hyper-local events from one or more web pages
CN103034707A (en) Website navigation method, device and browser client
CN105786965A (en) URL-based user behavior analysis method and device
CN104881428A (en) Information graph extracting and retrieving method and device for information graph webpages
CN102902790B (en) Web page classification system and method
Wang et al. The crawling and analysis of agricultural products big data based on Jsoup
CN102902792B (en) list page identification system and method
JP2018205978A (en) Information extracting device and information extracting method
CN103927383A (en) Web page presenting device and method
CN102929948B (en) list page identification system and method
CN102890717A (en) System and method for building webpage category knowledge base
CN102902791A (en) Webpage sorted storage system and method
CN114610802A (en) Word carousel method, device, equipment and storage medium
CN106537372B (en) Entity identification for enhanced document productivity
Swezey et al. Intelligent page recommender agents: real-time content delivery for articles and pages related to similar topics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220726

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right