CN103186532A - Method and device for capturing key pictures in web page - Google Patents
Method and device for capturing key pictures in web page Download PDFInfo
- Publication number
- CN103186532A CN103186532A CN201110443869XA CN201110443869A CN103186532A CN 103186532 A CN103186532 A CN 103186532A CN 201110443869X A CN201110443869X A CN 201110443869XA CN 201110443869 A CN201110443869 A CN 201110443869A CN 103186532 A CN103186532 A CN 103186532A
- Authority
- CN
- China
- Prior art keywords
- picture
- webpage
- pictures
- canonical
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method and device for capturing key pictures in a web page. The method comprises the following steps of: A, acquiring a document object model (DOM) structure of the web page according to a web page address; B, positioning the central node of the web page according to the DOM structure of the web page; C, performing regular matching on pictures at the central node and the brother nodes of the central node, filtering the pictures obtained by performing the regular matching according to a preset filtering condition, and outputting the pictures which are in accordance with the filtering condition; and D, taking the pictures output in the step C as the captured key pictures of the web page. The device comprises a corresponding DOM structure acquisition module, a node determination module, a regular matching module, a filter and a key picture determination module. By utilizing the method and the device, the coincidence degree of the captured key pictures of the web page and the subject contents of the web page can be improved; human-computer interaction times can be reduced; and the operation can be simplified.
Description
Technical field
The present invention relates to the internet information process field, relate in particular to grasping means and the device of key picture in a kind of Webpage (abbreviation webpage).
Background technology
The function of sharing that has occurred internet content at present, for example some microblogging platform can provide and share interface, and the third party website can be inserted this and shares interface the web page contents of this website is shared in the microblogging system, thereby has promoted user's experience.Present share the web page contents that interface shares and mainly comprise: the picture in the link of webpage, simplified summary character introduction and the webpage.Detailed process is: the user shares the information such as chained address, subject content and picture that interface can grasp this webpage after clicking and sharing button, and these these information are shared in the goal systems, for example shares in the microblogging.Interface is shared in utilization, and the user can be shared with webpage that like or valuable his bean vermicelli, audience or the good friend in the microblogging system, thereby has increased the flowing of access to this webpage.This interface of sharing has been utilized widely on the third party website at present.
Existing this when sharing the picture of interfacing in sharing web page, need carry out the multistep operation: at first, all pictures in the webpage are extracted be shown to the user, by the artificial key picture of selecting wherein of clicking of user; Secondly, after the selection instruction of receiving the user, confirm the final picture of sharing again; Click up to the user at last and just picture is shared in the goal systems (as the microblogging platform) after determining to share.
Website page is explained one or one possibly with last subject content, and the picture of these subject contents of pictute (or replenishing) is exactly key picture, for example: the attached picture of the news of news pages.
But during the picture of prior art in sharing web page, there is following shortcoming:
Can't accomplish the intelligent key picture that grasps in the webpage, the man-machine interaction number of times of user and internet machine side is too much, complicated operation; And its picture of selecting the often matching degree with the subject content of webpage is low, it or not key picture, when especially in webpage, having a large amount of pictures and icon, can't find key picture wherein especially quick, intelligently, what often select is the picture that has nothing to do, the user operates more complicated when sharing picture, selects the time of wait longer.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide grasping means and the device of key picture in a kind of Webpage, with the key picture of the raising webpage that grasped and the matching degree of Web page subject content, reduces the man-machine interaction number of times, simplifies the operation.
Technical scheme of the present invention is achieved in that
The grasping means of key picture in a kind of webpage comprises:
A, obtain DOM Document Object Model (DOM, the Document Object Model) structure of webpage according to web page address;
B, according to the Centroid of the DOM structure locating web-pages of webpage;
C, canonical are mated the picture at described Centroid and brotgher of node place thereof, according to default filtercondition the picture that canonical matches are filtered, and output meets the picture of filtercondition;
D, with the picture of the step C output key picture as the described webpage that grabs.
The grabbing device of key picture is characterized in that in a kind of webpage, and this device comprises:
DOM structure acquisition module is for the DOM structure of obtaining webpage according to web page address;
The node determination module is used for the Centroid according to the DOM structure locating web-pages of described webpage, and Centroid is input to the canonical matching module;
The canonical matching module is used for node that the canonical coupling imports and the picture at brotgher of node place thereof, and the picture that output matches is to filtrator;
Filtrator is used for filtering according to the picture of default filtercondition to input, and output meets the picture of filtercondition;
The key picture determination module is for the key picture of the picture that described filtrator is exported as the described webpage that grabs.
Compared with prior art, the present invention utilizes the Centroid of the DOM structure locating web-pages of webpage, canonical is mated the picture at described Centroid and brotgher of node place thereof then, and filters according to default filtercondition, with the key picture of the picture after filtering as webpage.Described Centroid and the brotgher of node thereof and Web page subject content matching degree are higher, and picture is through the filtration of filtercondition, finally can improve the key picture that grasps and the matching degree of Web page subject content, the extracting step of key picture of the present invention can be carried out by computing machine fully simultaneously, the user only need manual activation once flow process get final product, reduce the man-machine interaction number of times, simplified operation, saved corresponding computational resource and bandwidth resources.
Description of drawings
Fig. 1 is a kind of process flow diagram of key picture grasping means in the webpage of the present invention;
Fig. 2 is the weight synoptic diagram of a kind of webpage DOM structure (being also referred to as dom tree) node;
Fig. 3 is the process flow diagram of a kind of specific embodiment of the method for the invention;
Fig. 4 is a kind of composition synoptic diagram of key picture grabbing device in the webpage of the present invention.
Embodiment
The present invention is further described in more detail below in conjunction with drawings and the specific embodiments.
DOM can visit and revise the content and structure of a document in a kind of mode that is independent of platform and language.DOM is expression and the common method of handling a HTML(Hypertext Markup Language) or extend markup language (XML) document.Therefore and present webpage all is based on HTML or XML document, the present invention is based on the highest Centroid of the DOM structure analysis of webpage and subject content matching degree.
Fig. 1 is a kind of process flow diagram of key picture grasping means in the webpage of the present invention; Referring to Fig. 1, method of the present invention comprises:
Described web page address generally is uniform resource locator (URL, Universal Resource Locator) address, and the URL address is to go up a kind of identification method of webpage and other resource addresses for intactly describing the Internet (Internet).The present invention often shares technology with webpage in actual applications and uses simultaneously, shares the URL address that interface can get access to this webpage during user's sharing web page, and this step 101 can be utilized and share the URL address that interface gets access to.The concrete grammar that obtains the DOM structure can adopt existing known technology, repeats no more herein.
Concrete localization method can be determined according to the H label in the DOM structure herein, described H tag identifier the weight of web page joint, wherein H1 label node weight is the highest, H2 label node weight is taken second place, H3 label node weight is taken second place again, by that analogy.In this step, can be according to described H label according to the one or more Centroid of weight positioned in sequence from high to low; A plurality of nodes for the H label of same weight grade can sort to these nodes according to the structure of web page order.
This step 103 can also have numerous embodiments, specifically introduces in the following embodiments.
Fig. 2 is the weight synoptic diagram of a kind of webpage DOM structure (being also referred to as dom tree) node.Referring to Fig. 2, H1 label node and H2 label node are (because length is limited, only marked the H1 label node among Fig. 2) node content generally be the subject information (meeting w3c standard and SEO optimizing criterion) of webpage, and key picture is often near H1 label node or H2 label node, it is distance H 1 label node, the more near node picture weight of H2 label node is more high, and described distance can be determined according to the path in the DOM structure (path length).
Fig. 3 is the process flow diagram of a kind of specific embodiment of the method for the invention.This embodiment is that example describes with DOM structure webpage shown in Figure 2.Referring to Fig. 2 and Fig. 3, this flow process comprises:
If described alli array is empty (namely not having canonical to match the picture at H1 node and brotgher of node place thereof), perhaps described findi array is empty (namely having filtered out all pictures through after the described filtration treatment), and then execution in step 308.
If described alli array is empty (namely not having canonical to match the picture at H1 node father node and brotgher of node place thereof), perhaps described findi array is empty (namely having filtered out all pictures through after the described filtration treatment), and then execution in step 313.
In the present embodiment, if can not find key picture, only described Centroid and its father node double-layer structure described canonical coupling and filtration treatment have been done.In a further embodiment, if this father node and the brotgher of node thereof can not mated picture or the picture that matches is all filtered out, the last layer father node of this father node be can also further determine, and above-mentioned canonical coupling and filter process repeated.By that analogy, can also further determine the father node of last layer again, concrete level quantity can preestablish as required.
In above-mentioned steps, in case found the key picture of webpage, then return key picture and carry out the follow-up operation of sharing to sharing interface, and process ends.
But, after canonical was mated all Centroids or the centromere that mated of canonical count reach preset threshold value after, if do not match picture or filtered out all pictures through after the described filtration treatment, then further carry out following step:
If described alli array is empty (namely not having canonical to match the picture at H1 node father node and brotgher of node place thereof), perhaps described findi array is empty (namely having filtered out all pictures through after the described filtration treatment), and then execution in step 320.
In the said process, described filtrator is specially the method that the picture of importing filters:
At first carrying out form and filter, select and meet specified format, mainly is the picture of PNG and JPG form in the present embodiment;
Next carries out the attribute filtration, selects the picture that meets specified altitude assignment and width.The condition of described specified altitude assignment and width for example can be: picture tall and big in 139px and high wide while greater than 99px, perhaps picture is wider than 139px and high wide while greater than 99px.
In a further embodiment, further comprise in the described method that picture is filtered:
Be weighted through the picture that form filters and the attribute filtration is selected described according to alt attribute and title attribute, select the highest picture of weight; DOM structure according to described webpage, some pictures (because the text accompanying drawing is that key picture generally is continuous) that picture definite and that described weight is the highest is continuous, described some pictures are carried out again described form filters and attribute filters, picture and the described weight the highest picture of output by filtration turns back in the findi array as the output of filtrator.
Perhaps, can further include in the described method that picture is filtered:
From described picture through selection area maximum the picture that form filters and the attribute filtration is selected; DOM structure according to described webpage, determine the some pictures continuous with the picture of described area maximum, described some pictures are carried out again described form filters and attribute filters, picture and the picture of described area maximum of output by filtration turns back in the findi array as the output of filtrator.
Based on said method, the invention also discloses the grabbing device of key picture in a kind of webpage, this grabbing device can be carried out the grasping means of key picture in the above-mentioned webpage.Fig. 4 is a kind of composition synoptic diagram of key picture grabbing device in the webpage of the present invention.Referring to Fig. 4, this grabbing device 400 comprises:
DOM structure acquisition module 401 is for the DOM structure of obtaining webpage according to web page address.
Node determination module 402 is used for the Centroid according to the DOM structure locating web-pages of described webpage, and Centroid is input to the canonical matching module.This node determination module 402 can also determine that the father nodes at different levels of Centroid are input to described canonical matching module 403 according to the feedback result of canonical matching module 403 and filtrator in a further embodiment, determine that perhaps next Centroid is input to described canonical matching module 403, detailed process is as described in the above-mentioned method.
Canonical matching module 403 is used for node that the canonical coupling imports and the picture at brotgher of node place thereof, and the picture that output matches is to filtrator; The node of importing comprises Centroid and its father nodes at different levels.
Filtrator 404 is used for filtering according to the picture of default filtercondition to input, and output meets the picture of filtercondition.
Key picture determination module 405 is for the key picture of the picture that described filtrator is exported as the described webpage that grabs.
Wherein, described filtrator specifically comprises:
The form filtering module is used for selecting the picture that meets specified format (present embodiment mainly is the picture of PNG and JPG form);
The attribute filtering module is used for selecting the picture that meets specified altitude assignment and width.
In a kind of specific embodiment, described filtrator further comprises:
Module is selected in weighting, is used for being weighted through the picture that form filters and the attribute filtration is selected described according to alt attribute and title attribute, selects the highest picture of weight, imports the first gravity treatment module;
The first gravity treatment module, be used for the DOM structure according to described webpage, the continuous some pictures of picture definite and that described weight is the highest, described some pictures are input to form filtering module and attribute filtering module carry out again that form filters and attribute filters, output is by picture and the highest picture of described weight of filtration.
In another kind of specific embodiment, described filtrator further comprises:
Area is selected module, is used for filtering the picture that the picture of selecting is selected the area maximum from described through form filtration and attribute, imports the second gravity treatment module;
The second gravity treatment module, be used for the DOM structure according to described webpage, determine the some pictures continuous with the picture of described area maximum, described some pictures are input to form filtering module and attribute filtering module carry out again that form filters and attribute filters, output is by the picture of filtration and the picture of described area maximum.
Utilize the present invention, can realize that intelligence grasps the key picture of coupling subject content, omnidistance operating in a key, not only with the key picture of the raising webpage that grasped and the matching degree of Web page subject content, human-machine operation number of times in the time of can also reducing the user and share picture, improve user's experience, saved computational resource that too much human-machine operation causes and the waste of bandwidth resources.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.
Claims (12)
1. the grasping means of key picture in the webpage is characterized in that, comprising:
A, obtain the DOM Document Object Model DOM structure of webpage according to web page address;
B, according to the Centroid of the DOM structure locating web-pages of webpage;
C, canonical are mated the picture at described Centroid and brotgher of node place thereof, according to default filtercondition the picture that canonical matches are filtered, and output meets the picture of filtercondition;
D, with the picture of the step C output key picture as the described webpage that grabs.
2. method according to claim 1, it is characterized in that, among the described step C, if if do not have canonical to match the picture at described Centroid and brotgher of node place thereof or through having filtered out all pictures after the described filtration treatment, then further comprise: the father node of determining described Centroid according to the DOM structure of described webpage, canonical is mated the picture at this father node and brotgher of node place thereof, the described father node that canonical is matched according to default filtercondition and the picture at brotgher of node place thereof filter, and output meets the picture of filtercondition.
3. method according to claim 2, it is characterized in that, among the described step C, if do not carry out having filtered out all pictures after the filtration treatment if there is canonical to match the picture at described father node and brotgher of node place thereof or the described father node that canonical is matched and the picture of the brotgher of node thereof, then further comprise: the DOM structure according to described webpage is determined next Centroid, re-executes this step C.
4. method according to claim 3, it is characterized in that, among the step C, after canonical was mated all Centroids or the centromere that mated of canonical count reach preset threshold value after, if do not match picture or filtered out all pictures through after the described filtration treatment, then further comprise:
Canonical is mated the picture of the overall DOM structure of described webpage, according to default filtercondition the picture that described canonical matches is filtered, and output meets the picture of filtercondition.
5. according to each described method of claim 2 to 4, it is characterized in that the described method that picture is filtered is specially:
Carry out form and filter, select the picture that meets specified format;
Carry out attribute and filter, select the picture that meets specified altitude assignment and width.
6. method according to claim 5 is characterized in that, further comprises in the described method that picture is filtered:
Be weighted through the picture that form filters and the attribute filtration is selected described according to alt attribute and title attribute, select the highest picture of weight;
According to the DOM structure of described webpage, the continuous some pictures of picture definite and that described weight is the highest carry out described form filtration and attribute filtration again to described some pictures, and picture and the highest picture of described weight of filtration passed through in output.
7. method according to claim 5 is characterized in that, further comprises in the described method that picture is filtered:
From described picture through selection area maximum the picture that form filters and the attribute filtration is selected;
According to the DOM structure of described webpage, the continuous some pictures of picture definite and described area maximum carry out described form filtration and attribute filtration again to described some pictures, and the picture of filtration and the picture of described area maximum are passed through in output.
8. method according to claim 5 is characterized in that, the picture of the specified format described in described form filters is JPG picture and PNG picture.
9. the grabbing device of key picture in the webpage is characterized in that this device comprises:
DOM structure acquisition module is for the DOM structure of obtaining webpage according to web page address;
The node determination module is used for the Centroid according to the DOM structure locating web-pages of described webpage, and Centroid is input to the canonical matching module;
The canonical matching module is used for node that the canonical coupling imports and the picture at brotgher of node place thereof, and the picture that output matches is to filtrator;
Filtrator is used for filtering according to the picture of default filtercondition to input, and output meets the picture of filtercondition;
The key picture determination module is for the key picture of the picture that described filtrator is exported as the described webpage that grabs.
10. grabbing device according to claim 9 is characterized in that, described filtrator specifically comprises:
The form filtering module is used for selecting the picture that meets specified format;
The attribute filtering module is used for selecting the picture that meets specified altitude assignment and width.
11. grabbing device according to claim 10 is characterized in that, described filtrator further comprises:
Module is selected in weighting, is used for being weighted through the picture that form filters and the attribute filtration is selected described according to alt attribute and title attribute, selects the highest picture of weight, imports the first gravity treatment module;
The first gravity treatment module, be used for the DOM structure according to described webpage, the continuous some pictures of picture definite and that described weight is the highest, described some pictures are input to form filtering module and attribute filtering module carry out again that form filters and attribute filters, output is by picture and the highest picture of described weight of filtration.
12. grabbing device according to claim 10 is characterized in that, described filtrator further comprises:
Area is selected module, is used for filtering the picture that the picture of selecting is selected the area maximum from described through form filtration and attribute, imports the second gravity treatment module;
The second gravity treatment module, be used for the DOM structure according to described webpage, determine the some pictures continuous with the picture of described area maximum, described some pictures are input to form filtering module and attribute filtering module carry out again that form filters and attribute filters, output is by the picture of filtration and the picture of described area maximum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110443869.XA CN103186532B (en) | 2011-12-27 | 2011-12-27 | The grasping means of key picture and device in webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110443869.XA CN103186532B (en) | 2011-12-27 | 2011-12-27 | The grasping means of key picture and device in webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103186532A true CN103186532A (en) | 2013-07-03 |
CN103186532B CN103186532B (en) | 2019-05-10 |
Family
ID=48677703
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110443869.XA Active CN103186532B (en) | 2011-12-27 | 2011-12-27 | The grasping means of key picture and device in webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103186532B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544271A (en) * | 2013-10-18 | 2014-01-29 | 北京奇虎科技有限公司 | Picture processing window loading method and device for browsers |
CN114817639A (en) * | 2022-05-18 | 2022-07-29 | 山东大学 | Webpage graph convolution document ordering method and system based on comparison learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090307219A1 (en) * | 2008-06-05 | 2009-12-10 | Bennett James D | Image search engine using image analysis and categorization |
CN102270206A (en) * | 2010-06-03 | 2011-12-07 | 北京迅捷英翔网络科技有限公司 | Method and device for capturing valid web page contents |
CN102270234A (en) * | 2011-08-01 | 2011-12-07 | 北京航空航天大学 | Image search method and search engine |
-
2011
- 2011-12-27 CN CN201110443869.XA patent/CN103186532B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090307219A1 (en) * | 2008-06-05 | 2009-12-10 | Bennett James D | Image search engine using image analysis and categorization |
CN102270206A (en) * | 2010-06-03 | 2011-12-07 | 北京迅捷英翔网络科技有限公司 | Method and device for capturing valid web page contents |
CN102270234A (en) * | 2011-08-01 | 2011-12-07 | 北京航空航天大学 | Image search method and search engine |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544271A (en) * | 2013-10-18 | 2014-01-29 | 北京奇虎科技有限公司 | Picture processing window loading method and device for browsers |
CN103544271B (en) * | 2013-10-18 | 2017-03-15 | 北京奇虎科技有限公司 | Load Image in a kind of browser the method and apparatus for processing window |
CN114817639A (en) * | 2022-05-18 | 2022-07-29 | 山东大学 | Webpage graph convolution document ordering method and system based on comparison learning |
CN114817639B (en) * | 2022-05-18 | 2024-05-10 | 山东大学 | Webpage diagram convolution document ordering method and system based on contrast learning |
Also Published As
Publication number | Publication date |
---|---|
CN103186532B (en) | 2019-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3400540B1 (en) | Database operation using metadata of data sources | |
CN101408877B (en) | System and method for loading tree node | |
EP2321745B1 (en) | Providing posts to discussion threads in response to a search query | |
CN103365924B (en) | A kind of method of internet information search, device and terminal | |
CN111435344B (en) | Big data-based drilling acceleration influence factor analysis model | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
CN102760151B (en) | Implementation method of open source software acquisition and searching system | |
CN105760397B (en) | Internet of things ontology model processing method and device | |
DE102017111438A1 (en) | API LEARNING | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN104516982A (en) | Method and system for extracting Web information based on Nutch | |
CN102521232B (en) | Distributed acquisition and processing system and method of internet metadata | |
CN102930059A (en) | Method for designing focused crawler | |
KR102222287B1 (en) | Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL | |
CN104077402A (en) | Data processing method and data processing system | |
CN101984429A (en) | Method and device for acquiring destination page, search engine and browser | |
CN106202467A (en) | A kind of definable towards peer-to-peer network searches for the web crawlers method of emphasis | |
CN105302876A (en) | Regular expression based URL filtering method | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
CN104112015A (en) | DOM (document object model) and XML (extensible markup language) path language based intelligent substation SCD (substation configuration description) file parsing method | |
CN102063454A (en) | Method and equipment combining search and application | |
CN110309386B (en) | Method and device for crawling web page | |
CN101894109A (en) | Database building method and device | |
CN111949619A (en) | Dynamic directory generation method, system, electronic device and storage medium | |
CN103186532A (en) | Method and device for capturing key pictures in web page |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |