CN103399885B - Mining method and device of POI (point of interest) representing images and server - Google Patents

Mining method and device of POI (point of interest) representing images and server Download PDF

Info

Publication number
CN103399885B
CN103399885B CN201310306642.XA CN201310306642A CN103399885B CN 103399885 B CN103399885 B CN 103399885B CN 201310306642 A CN201310306642 A CN 201310306642A CN 103399885 B CN103399885 B CN 103399885B
Authority
CN
China
Prior art keywords
website
interest
point
representative picture
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310306642.XA
Other languages
Chinese (zh)
Other versions
CN103399885A (en
Inventor
孙明芳
牛正雨
刘峰
吴璞
吴一璞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310306642.XA priority Critical patent/CN103399885B/en
Publication of CN103399885A publication Critical patent/CN103399885A/en
Application granted granted Critical
Publication of CN103399885B publication Critical patent/CN103399885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a mining method and device of POI (point of interest) representing images and a server. The method includes acquiring a physical site corresponding to a POI from the Internet according to the name and address of the POI; searching for at least one representing page of the physical site according to an anchor text on the front page of the physical site; reading the representing pages and acquiring a set of representing images; acquiring at least one image meeting the predetermined feature from the set of representing images and using the images as representing images for the POI. Each representing page includes introduction of the physical site. The mining method and device allows representing images to be automatically mined from the physical site of the POI, the acquired representing images are more vivid, and the demand of users for knowing the POI is better met.

Description

The method for digging of point of interest representative picture, device server
Technical field
The present invention relates to network communication field, more particularly, to a kind of method for digging of point of interest representative picture, device kimonos Business device.
Background technology
Flourishing with mobile Internet, location Based service(Location-based service,LBS)Quilt More users accept.Point of interest(Point of interest,POI)It is the key concept in location Based service.Interest Point represents the position entities in electronic chart, and these position entities can be factory, school, shop, park etc..Interest point data Generally include the information such as the title of position entities, address, telephone number, position coordinates.Answering in some location Based service With in, point of interest(Point of interest,POI)Data also includes a representative picture.Representative picture typically shows interest The general picture of point, makes user that the point of interest that it is retrieved is had and more intuitively recognizes.
In prior art, the representative picture of point of interest is typically all from vertical station for acquiring.Vertical website is to solve One class website of the particular demands in specific area for the user, compared with comprehensive website, vertical website provides Service is more professional.However, the representative picture from vertical station for acquiring is often not clear, and often carry watermark information. And the rapid popularization with internet, the business entity such as increasing businessman, enterprises and institutions starts to have the entity of oneself Website.But, the solution of the representative picture of point of interest is not still obtained at present from entity website.
Content of the invention
In view of this, the present invention proposes a kind of method for digging of point of interest representative picture, device server, can be from emerging Relatively sharp representative picture is more accurately obtained on the entity website of interest point.
In a first aspect, embodiments providing a kind of method for digging of point of interest representative picture, methods described includes:
Title according to point of interest and address obtain described point of interest corresponding entity website from internet;
Anchor Text in homepage according to described entity website searches at least one representing pages of described entity website, its In, described representing pages include the recommended information of described entity website;
Read described representing pages, obtain representative picture set;And
Obtain at least one pictures meeting predetermined characteristic in described representative picture set the most as described point of interest Representative picture.
Second aspect, embodiments provides a kind of excavating gear of point of interest representative picture, and described device includes:
Entity station for acquiring module, obtains described point of interest pair for the title according to point of interest and address from internet The entity website answered;
Representing pages searching modul, searches described entity website for the Anchor Text in the homepage according to described entity website At least one representing pages, wherein, described representing pages include the recommended information of described entity website;
Representative picture set acquisition module, for reading described representing pages, obtains representative picture set;And
Representative picture acquisition module, meets at least one of predetermined characteristic the most for obtaining in described representative picture set Picture is as the representative picture of described point of interest.
The third aspect, embodiments provides a kind of server, and described server is included as above described in second aspect Point of interest representative picture excavating gear.
The present invention obtains the entity website of described point of interest by using the title of point of interest and address from internet, from Obtain in numerous webpages of entity website and represent webpage, obtain representative picture set from representing webpage, and utilize picture classification Device obtain representative picture from described representative picture set it is achieved that from the entity website of internet the representative graph to point of interest The acquisition of piece, expands the scope of search representative picture, improves the accuracy of representative picture acquisition, and improves acquisition The definition of representative picture.
Brief description
Fig. 1 is the schematic flow sheet of the method for digging of point of interest representative picture that first embodiment of the invention provides.
Fig. 2 is the schematic flow sheet of the entity station for acquiring that first embodiment of the invention provides.
Fig. 3 is the schematic flow sheet that the representing pages that first embodiment of the invention provides are searched.
Fig. 4 is the schematic flow sheet that the representative picture set that first embodiment of the invention provides obtains.
Fig. 5 is the schematic flow sheet of the method for digging of point of interest representative picture that second embodiment of the invention provides.
Fig. 6 is the structural representation of the excavating gear of point of interest representative picture that third embodiment of the invention provides.
Fig. 7 is the schematic diagram of the server that can implement the embodiment of the present invention.
Specific embodiment
Further illustrate technical scheme below in conjunction with the accompanying drawings and by specific embodiment.
Fig. 1 to Fig. 4 shows the first embodiment of the present invention.
Fig. 1 is the flow chart of the method for digging of point of interest representative picture that first embodiment of the invention provides.Referring to Fig. 1, The method for digging of described point of interest representative picture includes:Step S110, the title according to point of interest and address are obtained from internet Take described point of interest corresponding entity website;Step S120, described in Anchor Text in the homepage according to described entity website is searched At least one representing pages of entity website, wherein, described representing pages include the recommended information of described entity website;Step S130, reads described representing pages, obtains representative picture set;And step S140, obtain in described representative picture set For meeting the representative picture as described point of interest at least one pictures of predetermined characteristic.
In step s 110, the title according to point of interest and address obtain the corresponding entity of described point of interest from internet Website.
In life, people may refer to same point of interest using different nouns.For example, " Chinese people liberate Hospital general of army " is known as " PLA General Hospital " again, or " 301 Hospital ".In order to avoid because using different titles Cause can not correctly obtain the situation of entity website, the present invention is obtained using the mode that title and the address of point of interest combine The corresponding entity website of described point of interest.
Fig. 2 is the flow chart of the entity station for acquiring that first embodiment of the invention provides.Referring to Fig. 2, described step S110 Including:Sub-step S111, carries out cutting word according to predefined semantic rules to the title of described point of interest, searches at least one first The title of page completely includes the website of described cutting word result, forms candidate's Website Hosting;Sub-step S112, obtains each candidate stations The real address information of point;And sub-step S113, by the ground of the real address information of each candidate's website and described point of interest Location is compared, using similarity highest candidate's website as described point of interest entity website.
In sub-step S111, cutting word is carried out to the title of described point of interest according to predefined semantic rules, search for The title of a few homepage completely includes the website of described cutting word result, forms candidate's Website Hosting.
It should be appreciated by those skilled in the art, the character string that the title of described point of interest is made up of multiple Chinese characters.For It is obtained in that the entity website related to the title of described point of interest, be first according to predefined semantic rules by described interest The corresponding character string of title of point carries out cutting word, generates cutting word result.
In a preferred embodiment of the present embodiment, cutting word is processed using based on the cutting word processing method understanding.And And, the semantic rules for carrying out Chinese understanding is predefined.
The string data that described cutting word result is made up of at least one character string.For example, " Beijing Hospital " is carried out The cutting word result that cutting word obtains is by " Beijing ", the character string dimension of " hospital " two character string compositions.
Then, search site title completely includes the entity station of described cutting word result on the internet, is obtained by search Entity website constitutes candidate's Website Hosting of described point of interest.
For example, using the cutting word result to " Beijing Hospital ", i.e. " Beijing ", " hospital " two character strings, from internet All entity stations that the title finding out website homepage comprises " Beijing " and " hospital " two character strings constitute candidate's Website Hosting. Through search, obtain two entity websites, be respectively:" Beijing Hospital ", the URL of its homepage(Uniform Resource locator, URL)It is " http://www.bjhmoh.cn/”;" affiliated hospital of Beijing University of Chinese Medicine the 3rd ", The URL of its homepage is " http://www.zydsy.com/cn/index/index.aspx”.By above-mentioned two Entity website collectively constitutes described candidate's Website Hosting.
In sub-step S112, obtain the real address information of each candidate's website.
Described real address information is entity representated by the described point of interest actual address in map.In entity website The page in generally include its actual address in map for the representative entity, that is, described entity website real address letter Breath.For example, the real address information of Beijing Hospital is " Dongcheng District, Beijing during March Dongdan Dahua Road 1 ".In the present embodiment, to institute State each the candidate's website in candidate's Website Hosting, be required for obtaining its real address information.
In the candidate's website in described candidate's Website Hosting, described real address information is included sometimes in described candidate stations In the homepage of point, sometimes include on the contact page of described candidate's website.Therefore, obtain the real address of each candidate's website The sub-step of information can by read described candidate's website homepage, read the contact page of described candidate's website or both Combination completing.
If obtaining the real address information of described candidate's website by the homepage reading described candidate's website, from described time Search the first keyword in the homepage of selective calling point, and the length after described first keyword is less than the character of the first length threshold String is as the real address information of described candidate's website.In a preferred embodiment of the present embodiment, described first is crucial Word is " address ".In another preferred embodiment of the present embodiment, described first length threshold is 35.
If obtain the real address information of described candidate's website by reading the contact page of described candidate's website, pass through The contact page that the second keyword finds described candidate's website is searched from the Anchor Text the homepage of described candidate's website, then from The first keyword is searched, it is long that the length after the first keyword that will find in the contact page is less than first in the described contact page The character string of degree threshold value is as the real address information of described candidate's website.In a preferred embodiment of the present embodiment, Described first keyword is " address ", and described second keyword is " contacting us ".And, another in the present embodiment is preferred In embodiment, described first length threshold is 35.
To by " Beijing Hospital " and " affiliated hospital of Beijing University of Chinese Medicine the 3rd " two times that entity website collectively constitutes Each candidate's station for acquiring real address information in selective calling point set.The real address of " Beijing Hospital " corresponding entity website Information is " Dongcheng District, Beijing during March Dongdan Dahua Road 1 ";And " affiliated hospital of Beijing University of Chinese Medicine the 3rd " corresponding entity station The real address information of point is " Chaoyang District, Beijing City Anwai little Guan 51 ".
In sub-step S113, the address of the real address information of each candidate's website and described point of interest is compared Relatively, using similarity highest candidate's website as described point of interest entity website.
After obtaining the real address information of described candidate's website, by the address of described point of interest and each candidate's website Real address information is compared, to determine that from least one candidate's website candidate's website is corresponding for described point of interest Entity website.
First, directly the real address information of the address of described point of interest and each candidate's website is compared, if There is the real address information of candidate's website with the addresses match of described point of interest it is determined that this candidate's website is described emerging Interest puts corresponding entity website.
If could not be determined with directly comparing of the real address information of described candidate's website by described interest dot address The corresponding entity website of described point of interest, the address of described point of interest is carried out full-shape/half-angle character conversion, then with each candidate The real address information of website is compared.If having candidate's website realistically after full-shape/half-angle character conversion The addresses match of location information and described point of interest is it is determined that this candidate's website is described point of interest corresponding entity website.
If through described interest dot address and the real address information of described candidate's website directly compare and full-shape/ Half-angle character turn after the real address information of interest dot address and described candidate's website relatively still could not determine described interest The corresponding entity website of point, the address of described point of interest is carried out cutting word, according in the real address information of each candidate's website Situation including the cutting word result that the address of described point of interest is carried out with cutting word calculates described point of interest and described candidate's website The similarity of real address information, and corresponding for similarity highest real address information candidate's website is defined as described interest The corresponding entity website of point.Specifically it is assumed that having P character in the cutting word result after cutting word is carried out to the address of described point of interest String, wherein, q character string includes in the real address information of candidate's website, then the real address of described candidate's website Information is (q ÷ P) × 100% with the similarity of the address of described point of interest.
In a preferred embodiment of the present embodiment, cutting word is processed using based on the cutting word processing method understanding.And And, the semantic rules for carrying out Chinese understanding is predefined.
Through address information relatively after, by " Beijing Hospital " corresponding entity website in described candidate's Website Hosting, that is, URL is " http:The entity website of //www.bjhmoh.cn/ " is as the entity website of described point of interest.
In the step s 120, the Anchor Text in the homepage according to described entity website searches at least the one of described entity website Individual representing pages, wherein, described representing pages include the recommended information of described entity website.
Introduce it will usually corresponding entity is described in detail to described entity website in the page in entity website, Also the picture of the general picture of showing described entity often occurs.Accordingly, it would be desirable to obtain described entity website introduce page conduct The representing pages of described entity.
Anchor Text is the class link on internet, and this kind of link, with text key word for link, points to another webpage.
Fig. 3 is the schematic flow sheet that the representing pages that first embodiment of the invention provides are searched.Referring to Fig. 3, described step S120 includes:Sub-step S121, Anchor Text in the all-links in the homepage of described entity website is comprised the 3rd keyword The pointed page of link is as representing pages.In a preferred embodiment of the present embodiment, described 3rd keyword bag Include " introduction ", " brief introduction " or " overview ".
In sub-step S121, Anchor Text in the all-links in the homepage of described entity website is comprised the 3rd keyword The pointed page of link as representing pages.
It should be appreciated by those skilled in the art that described entity website typically has an essential information introducing described entity The page.This page includes the recommended information to described entity, and the visitor of predominantly described entity website introduces described reality The basic condition of body.Therefore, what such page was referred to as described entity website introduces the page.And, described entity website Introduce the page often to include showing the picture of described entity, these pictures generally include the building general picture showing described entity Picture.Accordingly, it would be desirable to using the described page of introducing as the representing pages of described entity website.
The described link introducing the page of sensing is typically had on the homepage of described entity website, and these links generally and have The Anchor Text having the 3rd keyword is associated.Described 3rd keyword includes " introduction ", " brief introduction " or " overview ".Therefore, lead to Cross and identify whether the Anchor Text of all-links on described entity website comprises the 3rd keyword to identify the generation of described entity website The table page.
In a preferred embodiment of the present embodiment, described step S120 also includes sub-step S122, by described reality In the homepage of body website, the length of Anchor Text is less than the page pointed by link of the second length threshold as representing pages.Institute State in preferred embodiment, the value of described second length threshold is 5.
In sub-step S122, the length of Anchor Text in the homepage of described entity website is less than the chain of the second length threshold Connect the pointed page as representing pages.
It should be appreciated by those skilled in the art that in the homepage of described entity website, pointing to the described link introducing the page Anchor Text all comparatively short.In order to prevent to the misrecognition pointing to the described link introducing the page, need to limit to point to and given an account of Continue the page link corresponding Anchor Text length.Therefore, limit and point to the described representing pages introduced the page, that is, choose Link corresponding Anchor Text length be less than the second length threshold.In a preferred embodiment of the present embodiment, described The value of the second length threshold is 5.
In a preferred embodiment of the present embodiment, described step S120 also includes sub-step S123, if searched To representing pages be at least two, remove the representing pages of repetition from described at least two representing pages.
In sub-step S123, if the representing pages finding are at least two, from described at least two representing pages The middle representing pages removing repetition.
It is possible to plural link occur point to the same page in the link in the homepage of described entity website Situation.In order to avoid pointing to same representing pages from the two or more link of described entity station for acquiring, cause subsequently to weigh Multiple process operation, needs to carry out duplicate removal process to the representing pages obtaining.
If the representing pages getting are at least two, by the unified resource positioning of described at least two representing pages Symbol(Uniform resource locator,URL)It is compared;If the unified money of the plural page in representing pages Source electricity symbol is identical, then remove the corresponding representing pages of URL of repetition.
The lookup of the Anchor Text in the homepage to " Beijing Hospital " corresponding entity website, determines that unified resource positions Symbol is " http://www.bjhmoh.cn/templates/T_new_second/index.asp x?The page of nodeid=103 " Face is the representing pages of described entity website.
In step s 130, read described representing pages, obtain representative picture set.
Fig. 4 is the schematic flow sheet that the representative picture set that first embodiment of the invention provides obtains.Referring to Fig. 4, described Step S130 includes:Sub-step S131, parses described representing pages, obtains the DOM Document Object Model of described representing pages(DOM) Tree;Sub-step S132, travels through described DOM Document Object Model(DOM)Tree, the content according to described representing pages is by described representative page Face is divided into different content blocks;Sub-step S133, the content characteristic according to content blocks is labeled to different content block, and will be by The content blocks being labeled as web page core content block are as object content block;And sub-step S134, read described object content block Middle pixel value is more than the picture of minimum pixel threshold value, obtains representative picture set.
In sub-step S131, parse described representing pages, obtain the DOM Document Object Model of described representing pages(DOM) Tree.
DOM Document Object Model(Document object model,DOM)It is cross-platform by W3C tissue offer, and with The unrelated program norm that HTML, XHTML and XML file are interacted of programming language.It provide powerful to HTML, XHTML and XML file are written and read the API operating(Application programming interface, API).According to DOM Document Object Model(DOM)Specification, each of HTML, XHTML and XML file object is by as one Node, all of node is in that tree arranges.Document model object(DOM)This tree be referred to as document model pair As(DOM)Tree.In the present embodiment, using document model object(DOM)Tree described representing pages are parsed, And obtain representative picture set from described representing pages.
At present, on internet most webpage all using html language.In order to obtain the representative graph of described entity website Described representing pages are parsed by piece, obtain the DOM Document Object Model of described representing pages(DOM)Tree.
In sub-step S132, travel through described DOM Document Object Model(DOM)Tree, the content according to described representing pages is by institute State representing pages and be divided into different content blocks.
Document node is the DOM Document Object Model of html page(DOM)The root node of tree.Can be according to described representing pages The child node of document node distribution by the DOM Document Object Model of described representing pages(DOM)Tree splits into multiple subtrees, its In, each subtree corresponds to one piece of display block of described representing pages.Described display block is referred to as described representing pages Content blocks.That is, passing through the DOM Document Object Model of described representing pages(DOM)Tree splits into multiple subtrees will be described Representing pages are divided into different content blocks.
In sub-step S133, the content characteristic according to content blocks is labeled to different content block, and be identified by for The content blocks of web page core content block are as object content block.
In the present embodiment, using content block classifier, the content blocks with different content feature are labeled.Described Content block classifier is the grader classified according to the content blocks to webpage that the method training of machine learning is formed.Described Content blocks on webpage can be labeled as " navigation bar content blocks ", " web page core content block " and " contact by content block classifier Information content block ".
After completing the mark to different content block, using described " web page core content block " as the described representative picture of extraction Object content block.
In sub-step S134, read the picture that pixel value in described object content block is more than minimum pixel threshold value, obtain Representative picture set.
Object picture in described object content block, some pixel values are too low, are not suitable as described entity website Representative picture.In order to avoid pixel value too low picture being used as the representative picture of described entity website, to described representative graph The pixel of piece is provided with minimum pixel threshold value.In a preferred embodiment of the present embodiment, described minimum pixel threshold value is 1 Ten thousand pixels.
After selecting the object content block in described representing pages, from the DOM Document Object Model of described representing pages(DOM)Tree Middle identification object picture, reads the corresponding picture of object picture that identifies, in the picture that will read pixel value be more than described The picture of low pixel threshold value collects, and forms representative picture set.
In step S140, obtain at least one pictures conduct meeting predetermined characteristic in described representative picture set the most The representative picture of described point of interest.
The content possibility thousand of the picture in the representative picture set obtaining from the object content block of described representing pages is poor Ten thousand is other.But, due to the location Based service of the present embodiment(LBS)Application background, need from described representative picture set The middle content choosing picture is the picture of building as representative picture.The standard carrying out the selection of described representative picture is described generation The content characteristic of the picture in table picture set.
Described step S140 includes sub-step S141, and the picture classification device being obtained using training in advance is from representative picture set Middle at least one pictures choosing the maximum probability that image content is building are as representative picture.
For the differentiation being made whether building picture to the picture in described representative picture set according to image content, adopt With a picture classification device.Whether described picture classification device using image pattern recognition, build by the content of the picture to input Build and judged, and export the probable value that the content of described input picture is building.Described picture classification device needs to read in advance A large amount of building pictures and non-building picture, and extract the total content characteristic training formation picture classification model of building picture.Institute State after picture classification model formed, input described picture classification device one pictures, described picture classification device judges the picture of input Content be building probability, and export described probability.
In sub-step S141, the picture in the representative picture set of acquisition is inputted to described picture classification device, described According to the picture classification model that training in advance obtains, picture classification device judges that the content of described input picture is the probability built, Choose at least one pictures of the maximum probability that image content is building afterwards as representative picture from representative picture set.
The present embodiment passes through to obtain the entity website of point of interest from internet, searches and represent page from described entity website Face, obtains representative picture set from described representing pages, finally chooses representative picture from described representative picture set, realize Based on point of interest representative picture in location-based service from the automatic acquisition of entity website, whole acquisition process without manual intervention, The accuracy rate that picture obtains is high, and the representative picture obtaining has higher definition.
Fig. 5 shows the second embodiment of the present invention.
Fig. 5 is the schematic flow sheet of the method for digging of point of interest representative picture that second embodiment of the invention provides.Referring to Fig. 5, the method for digging of described point of interest representative picture includes:Step S510, the title according to point of interest and address are from internet Upper acquisition described point of interest corresponding entity website;Step S520, the Anchor Text in the homepage according to described entity website is searched At least one representing pages of described entity website, wherein, described representing pages include the recommended information of described entity website;Step Rapid S530, reads described representing pages, obtains representative picture set;Step S540, obtains in described representative picture set the most Meet the representative picture as described point of interest at least one pictures of predetermined characteristic;And step S550, to described representative graph Piece carries out additional treatments.
In the present embodiment, to step S540, step S110 in first embodiment of the invention extremely walks step S510 Rapid S140 is identical, will not be described here.
In step S550, additional treatments are carried out to described representative picture, wherein, described additional treatments include scaling, cut Sanction, watermark removal and edge sharpening.
The representative picture being obtained by step S510 to step S540 can not meet directly sometimes based on location-based service (LBS)Application in display requirement, need further additional treatments are done to described representative picture.Such as, the representative of acquisition The size of picture is undesirable, needs described representative picture to be zoomed in and out or cuts out.Or, in the representative picture of acquisition Including watermark, need the watermark in described representative picture is removed.For another example, building and background in the representative picture of acquisition Between boundary inconspicuous, need edge sharpening is carried out to described representative picture.
The present embodiment pass through entity station for acquiring, representing pages lookups, representative picture set acquisition, representative picture acquisition and Representative picture additional treatments, not only more accurately obtain the higher representative picture of definition from entity website, and to from The representative picture of entity station for acquiring does further additional image and processes the representative picture energy so that obtaining from entity website Enough directly applications in location Based service.
Fig. 6 is the structural representation of the excavating gear of point of interest representative picture that second embodiment of the invention provides.Described The excavating gear 600 of point of interest representative picture includes:Entity station for acquiring module 610, representing pages acquisition module 620, represent Picture set acquisition module 630, representative picture acquisition module 640, and representative picture additional process modules 650.
Described entity station for acquiring module 610 is searched for described emerging from internet for the title according to point of interest and address The entity website of interest point, and the entity searching website is associated with described point of interest.Described entity station for acquiring module 610 is wrapped Include candidate's website acquisition submodule 611, real address acquisition of information submodule 612 and real address information comparison sub-module 613.
Described candidate's website acquisition submodule 611 is used for according to predefined semantic rules, the title of described point of interest being entered Row cutting word, the title searching at least one homepage completely includes the website of described cutting word result, forms candidate's Website Hosting.
Described real address acquisition of information submodule 612 is used for searching for the corresponding reality of described candidate's website in candidate's website The real address information of body.Because, in described candidate's website, the real address information of described entity is potentially displayed in described time It is also possible to be shown on the contact page of described candidate's website in the homepage of selective calling point, so described real address acquisition of information is sub Module can obtain the real address information of described candidate's website it is also possible to from described candidate from the homepage of described candidate's website The real address information of described candidate's website is obtained on the contact page of website.
Described real address information comparison sub-module 613 is used for will be emerging with described for the real address information of each candidate's website Interest point address be compared, using similarity highest candidate's website as described point of interest entity website.Right raising is right The recognition efficiency of the real address information of described candidate's website, described real address information comparison sub-module 613 can will be described The address of point of interest is directly compared with the real address information of described candidate's website, full-shape/half-angle compares after changing, or The address of described point of interest is carried out comparing after cutting word.
Described representing pages acquisition module 620 is used for searching described reality according to the Anchor Text in the homepage of described entity website One or more representing pages of body website.Described representing pages acquisition module 620 includes keyword identification submodule 621.
Described keyword identification submodule 621 is used for Anchor Text bag in the all-links in the homepage of described entity website The pointed page containing the link of the 3rd keyword is as representing pages.In a preferred embodiment of the present embodiment, institute Stating the 3rd keyword is " contacting us ".
In a preferred embodiment of the present embodiment, described representing pages acquisition module 620 also includes Anchor Text limit Long submodule 622.Described Anchor Text limit for length's submodule 622 is used for being less than the length of Anchor Text in the homepage of described entity website The page pointed by link of the second length threshold is as representing pages.In a preferred embodiment of the present embodiment, institute The value stating the second length threshold is 5.
In another preferred embodiment of the present embodiment, described representing pages acquisition module 620 also includes representing page Face duplicate removal submodule 623.If the representing pages that described representing pages duplicate removal submodule 623 is used for finding are at least two, The representing pages of repetition are removed from described at least two representing pages.Judge that at least two representing pages are the sides of duplicate pages Formula is to compare the URL of at least two representing pages(URL).
Described representative picture set acquisition module 630 is used for reading described representing pages, obtains representative picture set.Described Representative picture set acquisition module 630 includes representing pages analyzing sub-module 631, content blocks divide submodule 632, object content Block determination sub-module 633 and representative picture set acquisition submodule 634.
Described representing pages analyzing sub-module 631 is used for parsing described representing pages, obtains the document of described representing pages Object model(DOM)Tree.Described DOM Document Object Model(DOM)Be W3G tissue announce with html file, XHTML file The program norm interacting with XML file.Using DOM Document Object Model(DOM)Can to html file, XHTML file and Various objects in XML file are operated.Described representing pages are resolved to document by described representing pages analyzing sub-module 631 Object model(DOM)Tree, obtaining representative picture from described representing pages for other modules provides available data structure.
Described division of teaching contents submodule 632 is used for traveling through described DOM Document Object Model(DOM)Tree, according to described representing pages Content described representing pages are divided into different content blocks.Described division of teaching contents submodule 632 passes through described representing pages DOM Document Object Model(DOM)Tree is divided into different subtrees, and described representing pages are divided into different content blocks.
Described object content block determination sub-module 633 is used for entering rower according to the content characteristic of content blocks to different content block Note, and it is identified by content blocks for web page core content block as object content block.
When the content characteristic according to content blocks is labeled to different content block, employ content block classifier.Described Content block classifier is the grader that Web page content block is labeled using machine learning method training.Through described content The mark to different content block for the block classifier, is identified by content blocks for web page core content block as object content block.
Described representative picture set acquisition submodule 634 is used for reading pixel value in described object content block and is more than minimum image The picture of plain threshold value, obtains representative picture set.
Described representative picture acquisition module 640 is used for obtaining in described representative picture set meeting predetermined characteristic the most extremely Few pictures are as the representative picture of described point of interest.
Described representative picture acquisition module 640 includes representative picture acquisition submodule 641.Described representative picture obtains submodule The probability that image content is building chosen from representative picture set by the picture classification device that block 641 is used for obtaining using training in advance At least one maximum pictures are as representative picture.
Described representative picture additional process modules 650 are used for carrying out additional treatments to described representative picture, wherein, described attached Plus process include scaling, cut out, watermark removal and edge sharpening.
The present embodiment utilize entity station for acquiring module, representing pages searching modul, representative picture set acquisition module and Representative picture acquisition module achieve from entity website obtain point of interest representative picture, with prior art from vertical website Middle acquisition representative picture compares, and the picture that the present embodiment obtains is more accurate, and definition is higher, and whole representative picture Acquisition process is not required to want manual intervention, fully automated.
Fig. 7 is the schematic diagram of the server that can implement the embodiment of the present invention.Described server is data handling system, figure The 7 multiple assemblies illustrating server, are not meant to any certain architectures and the mode of proxy component connection.It should also be appreciated that There is less assembly or be likely to be of and can be used for the present invention compared with other data handling systems of multicompartment.
As shown in fig. 7, server A 0 is a kind of form of data handling system, it can be formed as personal computer, pen Remember the various terminals form such as this computer, panel computer, digital media player, intelligent mobile communication terminal.Server A 0 can To include bus A1.Microprocessor A2, volatile memory A3 and nonvolatile memory A4 are all connected to bus A1, some In the case of, server can also include harddisk memory A5, and these parts carry out data exchange and communication by bus A1.Micro- place Reason device A2 can be independent microprocessor or one or more microprocessor set.Bus A1 will be above-mentioned multiple Assembly links together, and said modules is connected to display controller A6 and display device and input/output simultaneously(I/0)Dress Put A7.Input/output(I/0)Device A7 at least includes voice acquisition device for inputting voice and aobvious for shown Showing device, it can also be mouse, keyboard, modem, network interface, touch-control input device, body-sensing input unit, printing Machine and other devices well known in the art.Typically, input/output device A7 passes through i/o controller A8 and system It is connected.
Volatile memory A3 in server A 0 is also referred to as internal memory, and it has the fireballing feature of reading and writing data, specifically Ground, volatile memory A3 can be by dynamic random read-write memory (DRAM)Realize, dynamic random read-write memory needs to continue Power supply is to update or to maintain the data in memory.
Typically, nonvolatile memory A4 refers to the storage that data after electric current is turned off, being stored will not disappear Device, it can include such as read-only storage(ROM), flash memory(Flash Memory)Deng.
Bus A1 can include, by multiple bridging device capable of being connected well known in the art, controller and/or adapter, being connected with each other One or more of buses.I/0 controller A8 includes the USB for controlling USB peripheral device in an embodiment(General serial Bus)Adapter, the IEEE1394 controller for IEEE1394 ancillary equipment or the indigo plant for controlling Bluetooth peripheral Tooth controller, and it is applied to the peripheral controls of other peripheral interface standards.
Obviously, it will be understood by those skilled in the art that each module of the above-mentioned present invention or each step can be by as above Described communication terminal is implemented, the transmission for voice messaging and receive capabilities can be integrated on same communication terminal so that Communication terminal both can send and can also receive voice messaging.Alternatively, the embodiment of the present invention can with computer installation The program of execution, to realize, to be executed by processor such that it is able to be stored in storage device, described program is permissible It is stored in a kind of computer-readable recording medium, storage medium mentioned above can be read-only storage, disk or CD Deng;Or they are fabricated to respectively each integrated circuit modules, or the multiple modules in them or step are fabricated to list Individual integrated circuit modules are realizing.So, the present invention is not restricted to the combination of any specific hardware and software.
The foregoing is only embodiments of the invention, not thereby limit the present invention the scope of the claims, every using this Equivalent structure or equivalent flow conversion that bright specification accompanying drawing content is made, or directly or indirectly it is used in other related technology Field, is included within the scope of the present invention.

Claims (17)

1. a kind of method for digging of point of interest representative picture is it is characterised in that include:
Title according to point of interest and address obtain described point of interest corresponding entity website from internet;
Anchor Text in homepage according to described entity website searches at least one representing pages of described entity website, wherein, Described representing pages include the recommended information of described entity website;
Read described representing pages, obtain representative picture set;And
Obtain the representative meeting at least one pictures of predetermined characteristic in described representative picture set the most as described point of interest Picture;
The described title according to point of interest and address obtain described point of interest corresponding entity website from internet and include:
Cutting word is carried out to the title of described point of interest according to predefined semantic rules, the title searching at least one homepage is complete Comprise the website of described cutting word result, form candidate's Website Hosting;
Obtain the real address information of each candidate's website, wherein, described real address information is representated by described candidate's website Actual address in map for the entity;And
The real address information of each candidate's website and the address of described point of interest are compared, by similarity highest candidate Website is as the entity website of described point of interest.
2. the method for digging of point of interest representative picture according to claim 1 is it is characterised in that obtain each candidate's website Real address information include:
Search the first keyword from the homepage of described candidate's website, and it is long that the length after described first keyword is less than first The character string of degree threshold value is as the real address information of described candidate's website;Or
By searching the contact that the second keyword finds described candidate's website from the Anchor Text in the homepage of described candidate's website The page, then search the first keyword from the described contact page, the length after the first keyword that will find in the contact page Less than the first length threshold character string as described candidate's website real address information.
3. point of interest representative picture according to claim 1 method for digging it is characterised in that described according to described entity At least one representing pages that Anchor Text in the homepage of website searches described entity website include:
Anchor Text in the homepage of described entity website is comprised the pointed page of link of the 3rd keyword as representing page Face.
4. point of interest representative picture according to claim 3 method for digging it is characterised in that described according to described entity At least one representing pages that Anchor Text in the homepage of website searches described entity website also include:
The length of Anchor Text in the homepage of described entity website is less than the pointed page of link of the second length threshold as Representing pages.
5. point of interest representative picture according to claim 3 method for digging it is characterised in that described according to described entity At least one representing pages that Anchor Text in the homepage of website searches described entity website also include:
From representing pages described at least two, if the representing pages finding are at least two, remove the representative page of repetition Face.
6. the method for digging of point of interest representative picture according to claim 1 is it is characterised in that the described representative of described reading The page, obtains representative picture set and includes:
Parse described representing pages, obtain DOM Document Object Model (DOM) tree of described representing pages;
Travel through described DOM Document Object Model (DOM) tree, described representing pages are divided into difference by the content according to described representing pages Content blocks;
Content characteristic according to content blocks is labeled to different content block, and is identified by the content for web page core content block Block is as object content block;And
Read the picture that pixel value in described object content block is more than minimum pixel threshold value, obtain representative picture set.
7. the method for digging of point of interest representative picture according to claim 1 is it is characterised in that the described representative of described acquisition At least one pictures meeting predetermined characteristic in picture set the most include as the representative picture of described point of interest:
The maximum probability that image content is building chosen from representative picture set by the picture classification device being obtained using training in advance At least one pictures as representative picture.
8. the method for digging of point of interest representative picture according to claim 1 is it is characterised in that obtaining through image filtering After obtaining the representative picture of described point of interest, methods described also includes:
Additional treatments are carried out to described representative picture, wherein, described additional treatments include scaling, cut out, watermark removal and edge Sharpen.
9. a kind of excavating gear of point of interest representative picture is it is characterised in that include:
Entity station for acquiring module, obtains described point of interest for the title according to point of interest and address corresponding from internet Entity website;
Representing pages searching modul, searches described entity website extremely for the Anchor Text in the homepage according to described entity website Few representing pages, wherein, described representing pages include the recommended information of described entity website;
Representative picture set acquisition module, for reading described representing pages, obtains representative picture set;And
Representative picture acquisition module, for obtaining at least one pictures meeting predetermined characteristic in described representative picture set the most Representative picture as described point of interest;
Described entity station for acquiring module includes:
Candidate's Website Hosting generates submodule, for being cut to the title of described point of interest according to predefined semantic rules Word, the title searching at least one homepage completely includes the website of described cutting word result, forms candidate's Website Hosting;
Real address information reading submodule, for obtaining the real address information of each candidate's website, wherein, described realistically Location information is entity representated by the described candidate's website actual address in map;And
Real address information comparison sub-module, for by the address of the real address information of each candidate's website and described point of interest Be compared, using similarity highest candidate's website as described point of interest entity website.
10. the excavating gear of point of interest representative picture according to claim 9 is it is characterised in that described real address is believed The operation of the real address information that breath reading submodule obtains each candidate's website includes:
Search the first keyword from the homepage of described candidate's website, and it is long that the length after described first keyword is less than first The character string of degree threshold value is as the real address information of described candidate's website;Or
By searching the contact that the second keyword finds described candidate's website from the Anchor Text in the homepage of described candidate's website The page, then search the first keyword from the described contact page, the length after the first keyword that will find in the contact page Less than the first length threshold character string as described candidate's website real address information.
The excavating gear of 11. point of interest representative pictures according to claim 9 is it is characterised in that described representing pages are looked into Module is looked for include:
Keyword identifies submodule, crucial for Anchor Text in the all-links in the homepage of described entity website is comprised the 3rd The page pointed by link of word is as representing pages.
The excavating gear of 12. point of interest representative pictures according to claim 11 is it is characterised in that described representing pages are looked into Module is looked for also to include:
Anchor Text limit for length's submodule, for being less than the second length threshold by the length of Anchor Text in the homepage of described entity website The pointed page of link is as representing pages.
The excavating gear of 13. point of interest representative pictures according to claim 11 is it is characterised in that described representing pages are looked into Module is looked for also to include:
Representing pages duplicate removal submodule, during for being at least two in the representing pages finding, from representative described at least two The representing pages of repetition are removed in the page.
The excavating gear of 14. point of interest representative pictures according to claim 9 is it is characterised in that described representative picture collection Close acquisition module to include:
Representing pages analyzing sub-module, for parsing described representing pages, obtains the DOM Document Object Model of described representing pages (DOM) set;
Content blocks divide submodule, and for traveling through described DOM Document Object Model (DOM) tree, the content according to described representing pages will Described representing pages are divided into different content blocks;
Object content block determination sub-module, is labeled to different content block for the content characteristic according to content blocks, and will be by The content blocks being labeled as web page core content block are as object content block;And
Representative picture set acquisition submodule, is more than the figure of minimum pixel threshold value for reading pixel value in described object content block Piece, obtains representative picture set.
The excavating gear of 15. point of interest representative pictures according to claim 9 is it is characterised in that described representative picture obtains Delivery block includes:
Representative picture acquisition submodule, for the picture classification device that obtained using training in advance from representative picture set selection figure Piece content is at least one pictures of the maximum probability of building as representative picture.
The excavating gear of 16. point of interest representative pictures according to claim 9 is it is characterised in that described point of interest represents The excavating gear of picture also includes:
Representative picture additional process modules, for carrying out additional treatments to described representative picture, wherein, described additional treatments include Scale, cut out, watermark removal and edge sharpening.
A kind of 17. servers are it is characterised in that described server includes the arbitrary described point of interest representative graph of claim 9-16 The excavating gear of piece.
CN201310306642.XA 2013-07-19 2013-07-19 Mining method and device of POI (point of interest) representing images and server Active CN103399885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310306642.XA CN103399885B (en) 2013-07-19 2013-07-19 Mining method and device of POI (point of interest) representing images and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310306642.XA CN103399885B (en) 2013-07-19 2013-07-19 Mining method and device of POI (point of interest) representing images and server

Publications (2)

Publication Number Publication Date
CN103399885A CN103399885A (en) 2013-11-20
CN103399885B true CN103399885B (en) 2017-02-08

Family

ID=49563515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310306642.XA Active CN103399885B (en) 2013-07-19 2013-07-19 Mining method and device of POI (point of interest) representing images and server

Country Status (1)

Country Link
CN (1) CN103399885B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105190619B (en) * 2013-04-25 2019-08-06 Nec个人电脑株式会社 The program of terminal installation and device
CN104916240B (en) * 2015-06-11 2018-03-30 辽宁北斗平台科技有限公司 Guide system based on Big Dipper positioning
CN105069076A (en) * 2015-07-31 2015-11-18 北京奇虎科技有限公司 Method and apparatus for determining address information in home page of official website
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device
CN107292302B (en) * 2016-03-31 2021-05-14 阿里巴巴(中国)有限公司 Method and system for detecting interest points in picture
CN107402019A (en) * 2016-05-19 2017-11-28 北京搜狗科技发展有限公司 The method, apparatus and server of a kind of video navigation
CN108197203A (en) * 2017-12-28 2018-06-22 百度在线网络技术(北京)有限公司 A kind of shop front head figure selection method, device, server and storage medium
CN110609879B (en) * 2018-06-14 2022-11-01 百度在线网络技术(北京)有限公司 Interest point duplicate determination method and device, computer equipment and storage medium
CN110516094A (en) * 2019-08-29 2019-11-29 百度在线网络技术(北京)有限公司 De-weight method, device, electronic equipment and the storage medium of class interest point data
CN111737430B (en) * 2020-06-16 2024-04-05 北京百度网讯科技有限公司 Entity linking method, device, equipment and storage medium
CN111832483B (en) * 2020-07-14 2024-03-08 北京百度网讯科技有限公司 Point-of-interest validity identification method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521253A (en) * 2011-11-17 2012-06-27 西安交通大学 Visual multi-media management method of network users
CN102694829A (en) * 2011-03-23 2012-09-26 腾讯科技(深圳)有限公司 Method for displaying information, device thereof and background server
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100004995A1 (en) * 2008-07-07 2010-01-07 Google Inc. Claiming Real Estate in Panoramic or 3D Mapping Environments for Advertising

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102694829A (en) * 2011-03-23 2012-09-26 腾讯科技(深圳)有限公司 Method for displaying information, device thereof and background server
CN102521253A (en) * 2011-11-17 2012-06-27 西安交通大学 Visual multi-media management method of network users
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information

Also Published As

Publication number Publication date
CN103399885A (en) 2013-11-20

Similar Documents

Publication Publication Date Title
CN103399885B (en) Mining method and device of POI (point of interest) representing images and server
Chen et al. Function-based object model towards website adaptation
CN104461484B (en) The implementation method and device of front-end template
CN102460432B (en) Selective content extraction
CN104598577B (en) A kind of extracting method of Web page text
CA2918840C (en) Presenting fixed format documents in reflowed format
US9369418B2 (en) Determining additional information associated with geographic location information
CN102270206A (en) Method and device for capturing valid web page contents
WO2015047920A1 (en) Title and body extraction from web page
JP2020187733A (en) Application programming interface documentation annotation
CN108804469B (en) Webpage identification method and electronic equipment
CN103544178A (en) Method and equipment for providing reconstruction page corresponding to target page
CN105068989A (en) Place name and address extraction method and apparatus
CN109492177B (en) web page blocking method based on web page semantic structure
CN111310693A (en) Intelligent labeling method and device for text in image and storage medium
CN102651002A (en) Webpage information extracting method and system
CN113515928B (en) Electronic text generation method, device, equipment and medium
CN109033282A (en) A kind of Web page text extracting method and device based on extraction template
EP2423837B1 (en) Method and system for viewing web page and computer program product thereof
CN103942211A (en) Text page recognition method and device
CN103544150A (en) Method and system for providing recommendation information for mobile terminal browser
KR101391107B1 (en) Method and apparatus for providing search service presenting class of search target interactively
US10198408B1 (en) System and method for converting and importing web site content
CN105589918A (en) Method and device for extracting page information
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant