CN103514221B - A kind of web site resource management method and device - Google Patents

A kind of web site resource management method and device Download PDF

Info

Publication number
CN103514221B
CN103514221B CN201210222539.2A CN201210222539A CN103514221B CN 103514221 B CN103514221 B CN 103514221B CN 201210222539 A CN201210222539 A CN 201210222539A CN 103514221 B CN103514221 B CN 103514221B
Authority
CN
China
Prior art keywords
web site
page
index
resource
index page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210222539.2A
Other languages
Chinese (zh)
Other versions
CN103514221A (en
Inventor
刘承诚
薛晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210222539.2A priority Critical patent/CN103514221B/en
Publication of CN103514221A publication Critical patent/CN103514221A/en
Application granted granted Critical
Publication of CN103514221B publication Critical patent/CN103514221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention proposes a kind of web site resource management method and device, and wherein method includes step, checks the number of web site;If the number of web site is one, check whether web site has index page the most further;If there being index page, then it is optimized index page to generate the first index page;Without index page, then according to structural generation second index page of web site;And if the number of web site is two or more, then setting up cross-site index page based on semanteme.Web site resource management method according to embodiments of the present invention, by web site is detected according to whether the three all situations having index page different from the number of web site carry out the index page set up according to distinct methods, it is possible to fully excavate the degree of polymerization to site resource membership credentials and improving site resource and improve site page bandwagon effect.

Description

A kind of web site resource management method and device
Technical field
The present invention relates to web site resource membership credentials analysis mining field, particularly to a kind of web site resource management Method and device.
Background technology
Nowadays, web appization technology is the most increasingly common, and web site is converted to app and needs to provide this website Resource membership credentials, it is therefore desirable to be analyzed excavating to the domestic-investment source tissue in the station relation of web site, obtain structurized resource Membership credentials data.
At present the resource membership credentials to web site are excavated and are mainly excavated by manually checking, do not have into Ripe prior art, therefore has the disadvantage in that
(1) excavation to the resource membership credentials of website does not just use distinct methods according to classification difference, excavates not Comprehensively, and the degree of polymerization is the highest;
(2) not having fixing method for digging, the resource membership credentials obtained are the most clearly and more chaotic, it is impossible to more conveniently Structuring.
Summary of the invention
It is contemplated that at least solve one of above-mentioned technical problem.
To this end, the first of the present invention purpose is to propose a kind of web site resource management method.
Second object of the present invention is to propose a kind of web site resource managing device.
To achieve these goals, the web site resource management method of embodiment includes according to the first aspect of the invention Following steps: check the number of described web site;If the number of described web site is one, then check described web further Whether website has index page;If there being index page, then it is optimized to generate the first index page to described index page;Without Index page, then according to structural generation second index page of described web site;And if the number of described web site is two Above, then cross-site index page is set up based on semanteme.
Web site resource management method according to embodiments of the present invention, by the detection of web site according to whether there is rope Draw the index page that the page three all situations different from the number of web site are set up according to distinct methods, it is possible to fully excavate The degree of polymerization to site resource membership credentials and improving site resource and raising site page bandwagon effect.
For achieving the above object, the web site resource managing device of the embodiment of second aspect present invention includes: the first inspection Looking into module, described first checks that module is for checking the number of described web site;Second checks module, and described second checks mould Block, in the case of the number of described web site is one, checks whether described web site has index page;Optimize module, institute State optimization module in the case of having index page in described web site, be optimized to generate the first rope to described index page Draw page;Generation module, described first generation module is not in the case of described web site has index page, according to described web Structural generation second index page of website;And set up module, described module of setting up is for being two at the number of described web site In the case of more than individual, set up cross-site index page based on semanteme.
Web site resource managing device according to embodiments of the present invention, by the detection of web site according to whether there is rope Draw the index page that the page three kind situations different from the number of web site are set up according to distinct methods, it is possible to fully excavate and arrive Site resource membership credentials also improve the degree of polymerization of site resource and improve site page bandwagon effect.
The additional aspect of the present invention and advantage will part be given in the following description, and part will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or that add aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially with easy to understand, wherein:
Fig. 1 is the flow chart of a kind of web site resource management method according to one embodiment of the invention;
Fig. 2 is the flow chart of a kind of web site resource management method according to one embodiment of the invention;
Fig. 3 is the flow chart of a kind of web site resource management method according to one embodiment of the invention;
Fig. 4 is the flow chart of a kind of web site resource management method according to one embodiment of the invention;
Fig. 5 is the structural representation of the web site resource managing device according to one embodiment of the invention;
Fig. 6 is the structural representation of the web site resource managing device according to one embodiment of the invention;
Fig. 7 is the structural representation of the web site resource managing device according to one embodiment of the invention;And
Fig. 8 is the structural representation of the web site resource managing device according to one embodiment of the invention.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of described embodiment is shown in the drawings, the most from start to finish Same or similar label represents same or similar element or has the element of same or like function.Below with reference to attached The embodiment that figure describes is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
With reference to explained below and accompanying drawing, it will be clear that these and other aspects of embodiments of the invention.Describe at these With in accompanying drawing, specifically disclose some particular implementation in embodiments of the invention, represent the enforcement implementing the present invention Some modes of the principle of example, but it is to be understood that the scope of embodiments of the invention is not limited.On the contrary, the present invention All changes, amendment and equivalent in the range of spirit that embodiment includes falling into attached claims and intension.
Below with reference to Figure of description, web site resource management method according to embodiments of the present invention is described
A kind of web site resource management method, comprises the following steps: check the number of web site;If web site Number is one, checks whether web site has index page the most further;If there being index page, then index page is optimized with life Become the first index page;Without index page, then according to structural generation second index page of web site;And if web site Number be two or more, then set up cross-site index page based on semanteme.
Fig. 1 is the flow chart of the web site resource management method of one embodiment of the invention.
As it is shown in figure 1, web site resource management method according to embodiments of the present invention comprises the steps:
Step S101: check the number of web site.
Specifically, detection needs to obtain the number of the web site of structurized resource membership credentials.
Step S102: if the number of web site is one, checks whether web site has index page the most further.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be one, then starts to dig Dig the page of this web site, check for the index page of the index information comprising this web site.
Step S103: if there being index page, then be optimized index page to generate the first index page.
Specifically, if this web site has index page, then obtain the page info in this index page, this page is believed Breath is optimized acquisition and sets up the information needed for the first index page, generates the first index page of this web site.
Step S104: without index page, then according to structural generation second index page of web site.
Specifically, if this web site does not has index page, then the resource page of this web site is excavated, according to resource The structure of the acquisition of information web site of page, generates the second index page of this web site according to the structural information of web site.
Step S105: if the number of web site is two or more, then set up cross-site index page based on semanteme.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be two or more, then basis The semantic dependency of resource page contacts these web site, crosses over station for acquiring resource index organizational information, and according to getting Resource index organizational information generate index page.
Web site resource management method according to embodiments of the present invention, by the detection of web site according to whether there is rope Draw the index page that the page three kind situations different from the number of web site are set up according to distinct methods, it is possible to fully excavate and arrive Site resource membership credentials also improve the degree of polymerization of site resource and improve site page bandwagon effect.
Fig. 2 is the flow chart of the web site resource management method of another embodiment of the present invention.
As in figure 2 it is shown, web site resource management method according to embodiments of the present invention comprises the steps.
Step S201: check the number of web site.
Specifically, detection needs to obtain the number of the web site of structurized resource membership credentials.
Step S202: if the number of web site is one, checks whether web site has index page the most further.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be one, then starts to dig Dig the page of this web site, check for the index page of the index information comprising this web site.
Step S203: if there being index page, then delete the non-index information in index page.
Specifically, if this web site has index page, then obtain the full detail of this index pages and to this index pages Information is analyzed, and non-index information therein is deleted.
Step S204: delete the index entry of the resource page that not can connect in index page in web site.
Specifically, index entry information remaining in index page information is checked, sees whether these index entries can connect Resource page in the web site pointed to it, deletes and not can connect to the rope of resource page in himself pointed web site Draw item.
Step S205: the effective index entry in extraction index page is to generate the first index page.
Specifically, in extraction index page, remaining effective index information is incorporated on a page, generates the first index The page.
Step S206: without index page, then according to structural generation second index page of web site.
Specifically, if this web site does not has index page, then the resource page of this web site is excavated, according to resource The structure of the acquisition of information web site of page, generates the second index page of this web site according to the structural information of web site.
Step S207: if the number of web site is two or more, then set up cross-site index page based on semanteme.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be two or more, then basis The semantic dependency of resource page contacts these web site, crosses over station for acquiring resource index organizational information, and according to getting Resource index organizational information generate index page.
In one embodiment of the invention, non-index information includes advertisement and animation.
Web site resource management method according to embodiments of the present invention, by by the non-index information of index pages and nothing Effect index is deleted, and generates index page according to effectively index, it is possible to effectively obtain the resource membership credentials of web site, and relation is clear Clear, the degree of polymerization is higher.
Fig. 3 is the flow chart of the web site resource management method of another embodiment of the present invention.
As it is shown on figure 3, web site resource management method according to embodiments of the present invention comprises the steps.
Step S301: check the number of web site.
Specifically, detection needs to obtain the number of the web site of structurized resource membership credentials.
Step S302: if the number of web site is one, checks whether web site has index page the most further.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be one, then starts to dig Dig the page of this web site, check for the index page of the index information comprising this web site.
Step S303: if there being index page, then delete the non-index information in index page.
Specifically, if this web site has index page, then obtain the full detail of this index pages and to this index pages Information is analyzed, and non-index information therein is deleted.
Step S304: delete the index entry of the resource page that not can connect in index page in web site.
Specifically, index entry information remaining in index page information is checked, sees whether these index entries can connect Resource page in the web site pointed to it, deletes and not can connect to the rope of resource page in himself pointed web site Draw item.
Step S305: the effective index entry in extraction index page is to generate the first index page.
Specifically, in extraction index page, remaining effective index information is incorporated on a page, generates the first index The page.
Step S306: without index page, it is judged that whether the resource page in web site has title.
Specifically, if this web site does not has index page, then start to excavate the resource page of this web site from homepage, obtain The information of the resource page of this web site, it is judged that whether resource page has heading message.
Step S307: if it is, the title of extraction resource page is as index entry.
Specifically, if resource page has heading message, then extract the title in this resource page as index entry.
Step S308: if it is not, then generate the summary info of resource page as index entry.
Specifically, if resource page does not has heading message, then the main information comprised according to resource page generates summary letter Breath, and using this summary info as index entry.
Step S309: generate the second index page according to index entry.
Specifically, the index entry obtaining all resource pages is incorporated on a page, generates the second index page.
Step S310: if the number of web site is two or more, then set up cross-site index page based on semanteme.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be two or more, then basis The semantic dependency of resource page contacts these web site, crosses over station for acquiring resource index organizational information, and according to getting Resource index organizational information generate index page.
In one embodiment of the invention, non-index information includes advertisement and animation.
Web site resource management method according to embodiments of the present invention, by by the heading message of resource page or summary Information generate index entry, generate index pages further according to these index entries, improve resource membership credentials the degree of polymerization and Definition between relation.
Fig. 4 is the flow chart of the web site resource management method of another embodiment of the present invention.
As shown in Figure 4, web site resource management method according to embodiments of the present invention comprises the steps.
Step S401: check the number of web site.
Specifically, detection needs to obtain the number of the web site of structurized resource membership credentials.
Step S402: if the number of web site is one, checks whether web site has index page the most further.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be one, then starts to dig Dig the page of this web site, check for the index page of the index information comprising this web site.
Step S403: if there being index page, then delete the non-index information in index page.
Specifically, if this web site has index page, then obtain the full detail of this index pages and to this index pages Information is analyzed, and non-index information therein is deleted.
Step S404: delete the index entry of the resource page that not can connect in index page in web site.
Specifically, index entry information remaining in index page information is checked, sees whether these index entries can connect Resource page in the web site pointed to it, deletes and not can connect to the rope of resource page in himself pointed web site Draw item.
Step S405: the effective index entry in extraction index page is to generate the first index page.
Specifically, in extraction index page, remaining effective index information is incorporated on a page, generates the first index The page.
Step S406: without index page, it is judged that whether the resource page in web site has title.
Specifically, if this web site does not has index page, then start to excavate the resource page of this web site from homepage, obtain The information of the resource page of this web site, it is judged that whether resource page has heading message.
Step S407: if it is, the title of extraction resource page is as index entry.
Specifically, if resource page has heading message, then extract the title in this resource page as index entry
Step S408: if it is not, then generate the summary info of resource page as index entry.
Specifically, if resource page does not has heading message, then the main information comprised according to resource page generates summary letter Breath, and using this summary info as index entry.
Step S409: generate the second index page according to index entry.
Specifically, the index entry obtaining all resource pages is incorporated on a page, generates the second index page.
Step S410: if the number of web site is two or more, predefined and different semantic corresponding multiple moulds Plate.
Specifically, if checked, to obtain the web site of structurized resource membership credentials be two or more, then basis Semantic dependency presets relevant template corresponding to this semanteme.
Step S411: the resource page in plural web site is classified according to semantic dependency and organizes In one web site.
Specifically, obtain the information included in the resource page in each web site, and according to the language in resource page information The resource page of each web site is classified by justice dependency, relevant information semantic in each resource page is organized In first web site.
Step S412: find the first template corresponding to the first web site according to the first web site semantic dependency.
Specifically, according to the semantic dependency of the resource page of tissue in the first web site to predefined semantic template In make a look up, obtain first template corresponding with the semanteme of the first web site.
Step S413: the first mould will be filled into respectively according to the resource page of semantic dependency classification in the first web site In the sub-column of difference of plate.
Specifically, according to the attribute of key word, by the information of each resource page in the first web site according to semanteme Dependency is filled in the first template, according to the difference of column, adds corresponding resource page information.
Step S414: set up cross-site index page according to the information in different templates.
Specifically, inserting relevant information according in different semantic templates, the key word integrating each template is made with semantic For index entry, it is established that cross-site index page.
The specific implementation process of step S410 to S414 is exemplified below.
Such as, the semantic template of one books information of definition, the inside includes books essential information, popular comment, businessman's ratio The sub-columns such as valency, e-sourcing and other modules;It is then assumed that each site resource page has a book big talk design pattern, Then the relevant information talking about design pattern in each resource page is incorporated in the first web site as a class;Then basis The keyword lookup of the information of big talk design this this book of pattern is to books information model, according to the sub-column module in template by the Each the sub-column that relevant information in one web site is filled in template, then extracts the key message in template as rope Draw item, so set up multiple page and extract index entry, index entry can be integrated and set up cross-site index page.
In one embodiment of the invention, different semantemes includes novel title, news title, video name and commodity Title etc..
In one embodiment of the invention, non-index information includes advertisement and animation.
Web site resource management method according to embodiments of the present invention, by filling out the information classification tissue of multiple websites It is charged in template be indexed the generation of item, improves the definition between the degree of polymerization of resource membership credentials and relation.
Below with reference to Figure of description, web site resource managing device according to embodiments of the present invention is described.
A kind of web site resource managing device includes: first checks module, and first checks that module is used for checking web site Number;Second checks module, and second checks that module, in the case of the number of web site is one, checks that web site is No have index page;Optimize module, optimize module in the case of web site has index page, index page is optimized with Generate the first index page;Generation module, the first generation module, in the case of web site does not has index page, is stood according to web Structural generation second index page of point;And set up module, set up module for being plural feelings at the number of web site Under condition, set up cross-site index page based on semanteme..
Fig. 5 is the structural representation of the web site resource managing device of one embodiment of the invention.
As it is shown in figure 5, web site resource managing device according to embodiments of the present invention, including: first checks module 110, Second checks module 120, optimizes module 130, generation module 140 and set up module 150.
Specifically, first check that module 110 is for checking the number of web site;Second checks that module 120 is at web In the case of the number of website is one, check whether web site has index page;Optimize module 130 for having index in web site In the case of Ye, it is optimized index page to generate the first index page;Generation module 140 is not for indexing in web site In the case of Ye, according to structural generation second index page of web site;And set up module 150 for the number in web site In the case of two or more, set up cross-site index page based on semanteme.
More specifically, first checks that module 110 needs to obtain the web site of structurized resource membership credentials for detection Number;If second checks that module 120 to obtain structurized resource tissue pass for checking that module 110 checks first In the case of the web site of system is one, starts to excavate the page of this web site, check for comprising this web site The index page of index information;If optimizing module 130 for checking that module 120 checks this web site and has index page second In the case of, obtain the page info in this index page, this page info is optimized, and the first index page institute is set up in acquisition The information needed, generates the first index page of this web site;If second, generation module 140 is for checking that 120 module inspections should In the case of web site does not has index page, the resource page of this web site is excavated, according to the acquisition of information web of resource page The structure of website, generates the second index page of this web site according to the structural information of web site;And set up module 150 for If checking that module 110 checks that to obtain the web site of structurized resource membership credentials be plural feelings first Under condition, contact these web site according to the semantic dependency of resource page, cross over station for acquiring resource index organizational information, and Resource index organizational information according to getting generates index page.
Web site resource managing device according to embodiments of the present invention, checks the module detection to web site by two According to whether the three kinds of situations having index page different from the number of web site are come according to distinct methods by three different moulds The index page that block is set up, it is possible to fully excavate the degree of polymerization to site resource membership credentials and improving site resource and improve station Point page presentation effect.
Fig. 6 is the structural representation of the web site resource managing device of another embodiment of the present invention.
As shown in Figure 6, web site resource managing device according to embodiments of the present invention, including: first checks module 110, Second checks module 120, optimizes module 130, generation module 140 and set up module 150, wherein optimizes module 130 and includes deleting Except unit 131 and the first extracting unit 132.
Specifically, first check that module 110 is for checking the number of web site;Second checks that module 120 is at web In the case of the number of website is one, check whether web site has index page;Optimize module 130 for having index in web site In the case of Ye, it is optimized index page to generate the first index page;Generation module 140 is not for indexing in web site In the case of Ye, according to structural generation second index page of web site;And set up module 150 for the number in web site In the case of two or more, set up cross-site index page based on semanteme.Wherein delete unit 131 for deleting in index page Non-index information and index page not can connect to the index entry of resource page in web site;First extracting unit 132 is used for Effective index entry in extraction index page is to generate the first index page.
More specifically, first checks that module 110 needs to obtain the web site of structurized resource membership credentials for detection Number;If second checks that module 120 to obtain structurized resource tissue pass for checking that module 110 checks first In the case of the web site of system is one, starts to excavate the page of this web site, check for comprising this web site The index page of index information;If optimizing module 130 for checking that module 120 checks this web site and has index page second In the case of, obtain the page info in this index page, this page info is optimized, and the first index page institute is set up in acquisition The information needed, generates the first index page of this web site;If second, generation module 140 is for checking that 120 module inspections should In the case of web site does not has index page, the resource page of this web site is excavated, according to the acquisition of information web of resource page The structure of website, generates the second index page of this web site according to the structural information of web site;And set up module 150 for If checking that module 110 checks that to obtain the web site of structurized resource membership credentials be plural feelings first Under condition, contact these web site according to the semantic dependency of resource page, cross over station for acquiring resource index organizational information, and Resource index organizational information according to getting generates index page.Optimizing module this page info is optimized, and Obtain and set up the information needed for the first index page, when generating the first index page of this web site, obtained by removing module 131 This index pages information is also analyzed by the full detail of this index pages, non-index information therein is deleted, with Time index entry information remaining in index page information is checked, see these index entries whether can connect to its point to web Resource page in website, deletes and not can connect to the index entry of resource page in himself pointed web site;Then pass through First abstraction module 132 extracts remaining effective index information in index page and is incorporated on a page, generates the first index The page.
In one embodiment of the invention, non-index information includes advertisement and animation.
Web site resource managing device according to embodiments of the present invention, by removing module by the non-index of index pages Information and invalid index are deleted, and generate index page by abstraction module according to effectively index, it is possible to effectively obtain web site Resource membership credentials, and relation is clear, the degree of polymerization is higher.
Fig. 7 is the structural representation of the web site resource managing device of another embodiment of the present invention.
As it is shown in fig. 7, web site resource managing device according to embodiments of the present invention, including: first checks module 110, Second checks module 120, optimizes module 130, generation module 140 and set up module 150, wherein optimizes module 130 and includes deleting Except unit 131 and the first extracting unit 132, generation module 140 includes judging unit 141, the second extracting unit 142 and generation Unit 143.
Specifically, first check that module 110 is for checking the number of web site;Second checks that module 120 is at web In the case of the number of website is one, check whether web site has index page;Optimize module 130 for having index in web site In the case of Ye, it is optimized index page to generate the first index page;Generation module 140 is not for indexing in web site In the case of Ye, according to structural generation second index page of web site;And set up module 150 for the number in web site In the case of two or more, set up cross-site index page based on semanteme.Wherein delete unit 131 for deleting in index page Non-index information and index page not can connect to the index entry of resource page in web site;First extracting unit 132 is used for Effective index entry in extraction index page is to generate the first index page.The wherein judging unit 141 money in judging web site Whether the source page has title;In the case of the resource page that second extracting unit 142 is used in web site has title, The title of extraction resource page is as index entry;And signal generating unit 143 does not have mark for the resource page in web site In the case of topic, the summary info of generation resource page is as index entry, and generates the second index page according to index entry.
More specifically, first checks that module 110 needs to obtain the web site of structurized resource membership credentials for detection Number;If second checks that module 120 to obtain structurized resource tissue pass for checking that module 110 checks first In the case of the web site of system is one, starts to excavate the page of this web site, check for comprising this web site The index page of index information;If optimizing module 130 for checking that module 120 checks this web site and has index page second In the case of, obtain the page info in this index page, this page info is optimized, and the first index page institute is set up in acquisition The information needed, generates the first index page of this web site;If second, generation module 140 is for checking that 120 module inspections should In the case of web site does not has index page, the resource page of this web site is excavated, according to the acquisition of information web of resource page The structure of website, generates the second index page of this web site according to the structural information of web site;And set up module 150 for If checking that module 110 checks that to obtain the web site of structurized resource membership credentials be plural feelings first Under condition, contact these web site according to the semantic dependency of resource page, cross over station for acquiring resource index organizational information, and Resource index organizational information according to getting generates index page.Optimizing module this page info is optimized, and Obtain and set up the information needed for the first index page, when generating the first index page of this web site, obtained by removing module 131 This index pages information is also analyzed by the full detail of this index pages, non-index information therein is deleted, with Time index entry information remaining in index page information is checked, see these index entries whether can connect to its point to web Resource page in website, deletes and not can connect to the index entry of resource page in himself pointed web site;Then pass through First abstraction module 132 extracts remaining effective index information in index page and is incorporated on a page, generates the first index The page.At generation module 140 in the case of web site does not has index page, according to structural generation second index page of web site In, start to excavate the resource page of this web site especially by judging unit 141 from homepage, obtain the resource page of this web site Information, it is judged that whether resource page has heading message;If wherein resource page has heading message, then by the second extraction Unit 142 obtains index entry, extracts the title in this resource page as index entry, if resource page does not has heading message, then The main information comprised according to resource page by signal generating unit 143 generates summary info, and using this summary info as index , the index entry obtaining all resource pages is incorporated on a page, generates the second index page.
In one embodiment of the invention, non-index information includes advertisement and animation.
Web site resource managing device according to embodiments of the present invention, is believed the title of resource page by generation module Breath or summary info generate index entry, generate index pages further according to these index entries, improve the poly-of resource membership credentials Definition between right and relation.
Fig. 7 is the structural representation of the web site resource managing device of another embodiment of the present invention.
As it is shown in fig. 7, web site resource managing device according to embodiments of the present invention, including: first checks module 110, Second checks module 120, optimizes module 130, generation module 140 and set up module 150, wherein optimizes module 130 and includes deleting Except unit 131 and the first extracting unit 132;Generation module 140 includes judging unit 141, the second extracting unit 142 and generation Unit 143;Set up module 150 and include definition unit 151, taxon 152, retrieval unit 153, fill unit 154 and build Vertical unit 155.
Specifically, first check that module 110 is for checking the number of web site;Second checks that module 120 is at web In the case of the number of website is one, check whether web site has index page;Optimize module 130 for having index in web site In the case of Ye, it is optimized index page to generate the first index page;Generation module 140 is not for indexing in web site In the case of Ye, according to structural generation second index page of web site;And set up module 150 for the number in web site In the case of two or more, set up cross-site index page based on semanteme.The deletion unit 131 optimized in module 130 is used for deleting Index entry except the resource page that not can connect in the non-index information in index page and index page in web site;First takes out Take unit 132 for extract in index page effective index entry to generate the first index page.Judging unit in generation module 140 141 for judging whether the resource page in web site has title;Second extracting unit 142 is for the money in web site In the case of the source page has title, the title of extraction resource page is as index entry;And signal generating unit 143 is at web In the case of resource page in website does not has title, generate resource page summary info as index entry, and according to Index entry generates the second index page.Set up the definition unit 151 in module 150 corresponding many for predefined and different semanteme Individual template;Taxon 152 is for classifying the resource page in plural web site according to semantic dependency and organizing In the first web site;Retrieval unit 153 for according to the semantic dependency of the first web site find one corresponding Template;Fill unit 154 for correspondence will be filled into respectively according to the resource page of semantic dependency classification in the first web site In the different columns of template;And set up unit 155 for setting up cross-site index page according to the information in different templates.
More specifically, first checks that module 110 needs to obtain the web site of structurized resource membership credentials for detection Number;If second checks that module 120 to obtain structurized resource tissue pass for checking that module 110 checks first In the case of the web site of system is one, starts to excavate the page of this web site, check for comprising this web site The index page of index information;If optimizing module 130 for checking that module 120 checks this web site and has index page second In the case of, obtain the page info in this index page, this page info is optimized, and the first index page institute is set up in acquisition The information needed, generates the first index page of this web site;If second, generation module 140 is for checking that 120 module inspections should In the case of web site does not has index page, the resource page of this web site is excavated, according to the acquisition of information web of resource page The structure of website, generates the second index page of this web site according to the structural information of web site;And set up module 150 for If checking that module 110 checks that to obtain the web site of structurized resource membership credentials be plural feelings first Under condition, contact these web site according to the semantic dependency of resource page, cross over station for acquiring resource index organizational information, and Resource index organizational information according to getting generates index page.Optimizing module this page info is optimized, and Obtain and set up the information needed for the first index page, when generating the first index page of this web site, obtained by removing module 131 This index pages information is also analyzed by the full detail of this index pages, non-index information therein is deleted, with Time index entry information remaining in index page information is checked, see these index entries whether can connect to its point to web Resource page in website, deletes and not can connect to the index entry of resource page in himself pointed web site;Then pass through First abstraction module 132 extracts remaining effective index information in index page and is incorporated on a page, generates the first index The page.At generation module 140 in the case of web site does not has index page, according to structural generation second index page of web site In, start to excavate the resource page of this web site especially by judging unit 141 from homepage, obtain the resource page of this web site Information, it is judged that whether resource page has heading message;If wherein resource page has heading message, then by the second extraction Unit 142 obtains index entry, extracts the title in this resource page as index entry, if resource page does not has heading message, then The main information comprised according to resource page by signal generating unit 143 generates summary info, and using this summary info as index , the index entry obtaining all resource pages is incorporated on a page, generates the second index page.Set up module 150 based on When cross-site index page set up in semanteme, by semantic corresponding multiple templates that definition unit 151 is predefined and different, then lead to Cross the information included in the resource page that taxon 152 obtains in each web site, and according to the semanteme in resource page information The resource page of each web site is classified by dependency, and information semantic relevant in each resource page is organized the In one web site, then arrived according to the semantic dependency of the resource page of tissue in the first web site by retrieval unit 153 Predefined semantic template makes a look up, obtains first template corresponding with the semanteme of the first web site, then pass through and fill out Fill the unit 154 attribute according to key word, by the information of each resource page in the first web site according to semantic dependency It is filled in the first template, according to the difference of column, adds corresponding resource page information, finally by setting up unit 155 Insert relevant information according in different semantic templates, integrate the key word of each template with semantic as index entry, it is established that Cross-site index page.
The specific implementation process of setting up module 150 is exemplified below.
Such as, the semantic template of one books information of definition, the inside includes books essential information, popular comment, businessman's ratio The sub-columns such as valency, e-sourcing and other modules;It is then assumed that each site resource page has a book big talk design pattern, Then the relevant information talking about design pattern in each resource page is incorporated in the first web site as a class;Then basis The keyword lookup of the information of big talk design this this book of pattern is to books information model, according to the sub-column module in template by the Each the sub-column that relevant information in one web site is filled in template, then extracts the key message in template as rope Draw item, so set up multiple page and extract index entry, index entry can be integrated and set up cross-site index page.
In one embodiment of the invention, different semantemes includes novel title, news title, video name and commodity Title etc..
In one embodiment of the invention, non-index information includes advertisement and animation.
Web site resource managing device according to embodiments of the present invention, by filling out the information classification tissue of multiple websites It is charged in template be indexed the generation of item, improves the definition between the degree of polymerization of resource membership credentials and relation.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show Example " or the description of " some examples " etc. means to combine this embodiment or example describes specific features, structure, material or spy Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, permissible Understand and these embodiments can be carried out multiple change without departing from the principles and spirit of the present invention, revise, replace And modification, the scope of the present invention is limited by claims and equivalent thereof.

Claims (12)

1. a web site resource management method, it is characterised in that comprise the following steps:
Check the number of described web site;
If the number of described web site is one, then check whether described web site has index page further;
If there being index page, then it is optimized to generate the first index page to described index page;
Without index page, then according to structural generation second index page of described web site;And
If the number of described web site is two or more, then set up cross-site index page based on semanteme.
Web site resource management method the most according to claim 1, it is characterised in that described index page is optimized Include generating the step of the first index page:
Delete the non-index information in described index page;
Delete the index entry of the resource page that not can connect in described index page in described web site;And
Extract the effective index entry in described index page to generate described first index page.
Web site resource management method the most according to claim 1 and 2, it is characterised in that according to described web site The step of structural generation the second index page includes:
Judge whether the resource page in described web site has title;
If it is, extract the title of described resource page as index entry;
If it is not, then generate the summary info of described resource page as index entry;And
Described second index page is generated according to described index entry.
Web site resource management method the most according to claim 1 and 2, it is characterised in that set up cross-site based on semanteme The step of index page includes:
Predefined and different semantic corresponding multiple templates;
Resource page in web site more than said two is classified according to semantic dependency and organizes the first web site In;
The first template corresponding to the first web site is found according to described first web site semantic dependency;
The different sons of the first template will be filled into respectively according to the resource page of semantic dependency classification in described first web site In column;And
Described cross-site index page is set up according to the information in different templates.
Web site resource management method the most according to claim 4, it is characterised in that described different semanteme includes little Say title, news title, video name and trade name.
Web site resource management method the most according to claim 2, it is characterised in that described non-index information includes extensively Accuse and animation.
7. a web site resource managing device, it is characterised in that including:
First checks module, and described first checks that module is for checking the number of described web site;
Second checks module, and described second checks that module, in the case of the number of described web site is one, checks described Whether web site has index page;
Optimizing module, described index page, in the case of described web site has index page, is carried out excellent by described optimization module Change to generate the first index page;
Generation module, described generation module is not in the case of described web site has index page, according to described web site Structural generation the second index page;And
Setting up module, described module of setting up, in the case of being two or more at the number of described web site, is built based on semanteme Vertical cross-site index page.
Web site resource managing device the most according to claim 7, it is characterised in that described optimization module includes:
Deleting unit, described deletion unit is for deleting in the non-index information in described index page and described index page and can not connect Receive the index entry of resource page in described web site;And
First extracting unit, described first extracting unit is for extracting effective index entry in described index page to generate described the One index page.
9. according to the web site resource managing device described in claim 7 or 8, it is characterised in that described generation module includes:
Judging unit, described judging unit is for judging whether the resource page in described web site has title;
Second extracting unit, described second extracting unit has headed situation for the resource page in described web site Under, extract the title of described resource page as index entry;And
Signal generating unit, described signal generating unit, in the case of the resource page in described web site does not has title, generates The summary info of described resource page is as index entry, and generates described second index page according to described index entry.
10. according to the web site resource managing device described in claim 7 or 8, it is characterised in that set up module and include:
Definition unit, described definition unit is for predefined and different semantic corresponding multiple templates;
Taxon, described taxon is for being correlated with the resource page in web site more than said two according to semanteme In property is classified and is organized the first web site;
Retrieval unit, finds a corresponding template according to the semantic dependency of the first web site;
Filling unit, described filling unit is for the resource page will classified according to semantic dependency in described first web site It is filled into respectively in the different columns of corresponding templates;And
Setting up unit, described unit of setting up is for setting up described cross-site index page according to the information in different templates.
11. web site resource managing devices according to claim 10, it is characterised in that described different semanteme includes Novel title, news title and video name.
12. web site resource managing devices according to claim 8, it is characterised in that described non-index information includes extensively Accuse and animation.
CN201210222539.2A 2012-06-28 2012-06-28 A kind of web site resource management method and device Active CN103514221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210222539.2A CN103514221B (en) 2012-06-28 2012-06-28 A kind of web site resource management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210222539.2A CN103514221B (en) 2012-06-28 2012-06-28 A kind of web site resource management method and device

Publications (2)

Publication Number Publication Date
CN103514221A CN103514221A (en) 2014-01-15
CN103514221B true CN103514221B (en) 2016-12-28

Family

ID=49896954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210222539.2A Active CN103514221B (en) 2012-06-28 2012-06-28 A kind of web site resource management method and device

Country Status (1)

Country Link
CN (1) CN103514221B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1732459A (en) * 2002-11-01 2006-02-08 Lg电子株式会社 Web content transcoding system and method for small display device
CN101097578A (en) * 2007-06-07 2008-01-02 北京金山软件有限公司 Network resource searching method and system
CN101887422A (en) * 2009-05-13 2010-11-17 北京博越世纪科技有限公司 Technique for keeping synchronous update of data of web site and wap site

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPQ680300A0 (en) * 2000-04-10 2000-05-11 Alexsi Pty Ltd A method
US20070143283A1 (en) * 2005-12-09 2007-06-21 Stephan Spencer Method of optimizing search engine rankings through a proxy website
US20080275877A1 (en) * 2007-05-04 2008-11-06 International Business Machines Corporation Method and system for variable keyword processing based on content dates on a web page

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1732459A (en) * 2002-11-01 2006-02-08 Lg电子株式会社 Web content transcoding system and method for small display device
CN101097578A (en) * 2007-06-07 2008-01-02 北京金山软件有限公司 Network resource searching method and system
CN101887422A (en) * 2009-05-13 2010-11-17 北京博越世纪科技有限公司 Technique for keeping synchronous update of data of web site and wap site

Also Published As

Publication number Publication date
CN103514221A (en) 2014-01-15

Similar Documents

Publication Publication Date Title
CN102279894A (en) Method for searching, integrating and providing comment information based on semantics and searching system
CN104462501A (en) Knowledge graph construction method and device based on structural data
JP5587989B2 (en) Providing patent maps by viewpoint
CN106203761A (en) Extract and manifest the user job attribute from data source
CN105893551A (en) Method and device for processing data and knowledge graph
CN103810212A (en) Automated database index creation method and system
CN112131449A (en) Implementation method of cultural resource cascade query interface based on elastic search
CN104462508A (en) Character relation search method and device based on knowledge graph
CN106155769A (en) A kind of workflow processing method, device and workflow engine
CN103324622A (en) Method and device for automatic generating of front page abstract
CN104094278A (en) Pattern matching engine
CN102542061A (en) Intelligent product classification method
CN105138538A (en) Cross-domain knowledge discovery-oriented topic mining method
CN103186523A (en) Electronic device and natural language analyzing method thereof
CN104866527A (en) Dynamic webpage template matching method and device
CN104778238A (en) Video saliency analysis method and video saliency analysis device
CN104462504A (en) Method and device for providing reasoning process data in search
CN104102733A (en) Search content providing method and search engine
CN106055546A (en) Optical disk library full-text retrieval system based on Lucene
CN103235821A (en) Original content searching method and searching server
CN108520065A (en) Name construction method, system, equipment and the storage medium of Entity recognition corpus
CN103377225A (en) Method and device for building knowledge base system
CN103412880A (en) Method and device for determining implicit associated information between multimedia resources
CN106055641A (en) Human-computer interaction method and device oriented to intelligent robot
CN107391684A (en) A kind of method and system for threatening information generation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant