CN103324622A - Method and device for automatic generating of front page abstract - Google Patents

Method and device for automatic generating of front page abstract Download PDF

Info

Publication number
CN103324622A
CN103324622A CN2012100754141A CN201210075414A CN103324622A CN 103324622 A CN103324622 A CN 103324622A CN 2012100754141 A CN2012100754141 A CN 2012100754141A CN 201210075414 A CN201210075414 A CN 201210075414A CN 103324622 A CN103324622 A CN 103324622A
Authority
CN
China
Prior art keywords
homepage
website
descriptor
pending
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100754141A
Other languages
Chinese (zh)
Inventor
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2012100754141A priority Critical patent/CN103324622A/en
Publication of CN103324622A publication Critical patent/CN103324622A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a device for the automatic generating of a front page abstract. The method comprises the following steps: A, determining a plurality of front pages provided with describing information and belonging to the same category website as a to be processed front page; B, extracting the abstract module by the determined describing information of the front pages; C, extracting key words from the to be extracted front page and filling the key words to corresponding grooves positioned in the abstract module, so as to obtain the abstract of the to be processed front page. The method provided by the invention can improve the accuracy of the abstract for describing themes of a front page.

Description

A kind of method and device of automatic generation homepage summary
[technical field]
The present invention relates to natural language processing technique, particularly a kind of method and device of automatic generation homepage summary.
[background technology]
For search engine; when providing result for retrieval to the user; except providing the link of result for retrieval, usually also can below the link of result for retrieval, provide the summary info that the page is pointed in link, understand rapidly the main contents that this links the page pointed to help the user.Please refer to Fig. 1, Fig. 1 is for providing the schematic diagram of summary info for corresponding webpage in the result for retrieval of search engine.For summary info as shown in Figure 1 is provided to the user, search engine at first needs to extract summary from the page.The meta label of source file is not provided a description the generic web page of information, search engine can be by extracting the keyword of the page as the summary of this page to the semantic analysis of content of pages, but the webpage of descriptor is provided in the meta label to source file, and search engine can be with the summary of this descriptor as respective page.Please refer to Fig. 2, Fig. 2 is the schematic diagram that comprises descriptor in the meta label of webpage source file.
The homepage of website is the default page when opening a website, usually the homepage of website has the effect of navigation, therefore the information that comprises is more mixed and disorderly, when homepage is carried out semantic analysis, be difficult to obtain accurately homepage theme, when this just causes the homepage that lacks descriptor in to the meta label of source file when the mode that adopts the said extracted summary to carry out abstract extraction, obtain the mixed and disorderly summary info of content, affected the accuracy of summary info.Please refer to Fig. 3, the summary diagram of Fig. 3 for adopting prior art that the homepage that lacks descriptor is extracted.Compare with the descriptor shown in Fig. 2, can find out that the summary that shows among Fig. 3 lacks consistent semantic logic, relatively poor to the accuracy of homepage subject description.
[summary of the invention]
Technical matters to be solved by this invention provides a kind of method and device of automatic generation homepage summary, to solve the defective of prior art automatic summary poor accuracy that generates when the meta of homepage label lacks descriptor.
The present invention is the method that technical scheme that the technical solution problem adopts provides a kind of automatic generation homepage summary, comprising: A, determine that a plurality of and pending homepage belongs to same classification website and has the homepage of descriptor; The descriptor of a plurality of homepages that B, utilization are determined extracts the summary template; C, from described pending homepage, extract keyword and be filled to corresponding groove position in the described summary template, obtain the summary of described pending homepage.
The preferred embodiment one of according to the present invention also comprised before described steps A: judge whether described pending homepage exists descriptor, if so, then directly with the summary of described descriptor as described pending homepage; Otherwise, carry out described steps A.
The preferred embodiment one of according to the present invention, described steps A specifically comprises: A1, according to default categories of websites table, determine to belong to other candidate website of same class with described pending homepage; A2, from homepage corresponding to described candidate website, obtain a plurality of homepages with descriptor.
The preferred embodiment one of according to the present invention, described categories of websites table are to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
The preferred embodiment one of according to the present invention, determine in the described steps A 2 that the step of the homepage that described candidate website is corresponding specifically comprises: the website that inquiry is default and the mapping table between the homepage, to obtain respectively corresponding homepage of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
The preferred embodiment one of according to the present invention, described homepage feature specifically comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
The preferred embodiment one of according to the present invention, described step B specifically comprises: compare the descriptor of described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
The preferred embodiment one of according to the present invention extracts anchor text word as keyword from described pending homepage among the described step C.
The preferred embodiment one of according to the present invention, the corresponding groove position of described summary template comprises: website name and navigation theme; Described step C specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
The present invention also provides a kind of device of automatic generation homepage summary, comprising: the homepage determining unit is used for the homepage that definite a plurality of and pending homepage belongs to same classification website and has descriptor; The template generation unit is used for utilizing the descriptor of a plurality of homepages of determining to extract the summary template; Keyword extracting unit is used for extracting keyword and being filled to corresponding groove position the described summary template from described pending homepage, obtains the summary of described pending homepage.
The preferred embodiment one of according to the present invention, described device further comprises judging unit, described judging unit is connected to described homepage determining unit, be used for judging whether described pending homepage exists descriptor, if, then directly with the summary of described descriptor as described pending homepage, carry out otherwise trigger described homepage determining unit.
The preferred embodiment one of according to the present invention, described homepage determining unit specifically comprises: subelement is determined in the website, is used for according to default categories of websites table, determines to belong to other candidate website of same class with described pending homepage; Choose subelement, be used for obtaining a plurality of homepages with descriptor from homepage corresponding to described candidate website.
The preferred embodiment one of according to the present invention, described categories of websites table are to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
The preferred embodiment one of according to the present invention, the described subelement of choosing determines that the mode of the homepage that described candidate website is corresponding specifically comprises: the website that inquiry is default and the mapping table between the homepage, to obtain respectively corresponding homepage of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
The preferred embodiment one of according to the present invention, described homepage feature specifically comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
The preferred embodiment one of according to the present invention, the mode that described template generation unit extracts the summary template specifically comprises: the descriptor of comparing described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
The preferred embodiment one of according to the present invention, described keyword extracting unit is extracted anchor text word as keyword from pending homepage.
The preferred embodiment one of according to the present invention, the corresponding groove position of described summary template comprises: website name and navigation theme; The mode that described keyword extracting unit is extracted keyword specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
As can be seen from the above technical solutions, by from the similar of pending homepage and have and extract the summary template the website homepage of descriptor, and the template of will making a summary combines with keyword in the pending homepage, can be for lacking the good summary of the automatic generating structure of homepage of descriptor in the meta label, compared with prior art, greatly strengthened the accuracy of summary to the homepage subject description.
[description of drawings]
Fig. 1 is for providing the schematic diagram of summary info for corresponding webpage in the result for retrieval of search engine;
Fig. 2 is the schematic diagram that comprises descriptor in the meta label of webpage source file;
The summary diagram of Fig. 3 for adopting prior art that the homepage that lacks descriptor is extracted;
Fig. 4 is the process flow diagram that automatically generates the method for homepage summary among the present invention;
Fig. 5 is the schematic diagram of URL among the present invention;
Fig. 6 is the schematic diagram that has the anchor text word of navigation characteristic among the present invention on the webpage;
Fig. 7 is the schematic diagram that has the anchor text word of navigation characteristic on the source file of webpage among the present invention.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Please refer to Fig. 4, Fig. 4 is the process flow diagram that automatically generates the method for homepage summary among the present invention.As shown in Figure 4, the method comprises:
S101: determine that a plurality of and pending homepage belongs to same classification website and has the homepage of descriptor.
S102: utilize the descriptor of a plurality of homepages of determining to extract the summary template.
S103: from pending homepage, extract keyword and be filled to corresponding groove position in the summary template, obtain the summary of pending homepage.
The below is specifically described above-mentioned steps.
Step S101 comprises particularly:
Step S1011: according to default categories of websites table, determine to belong to other candidate website of same class with pending homepage.
Step S1012: from homepage corresponding to candidate website, obtain a plurality of homepages with descriptor.
In the categories of websites table classified in each website, when a known website, by searching the categories of websites table, just can know other websites similar with this website.
Categories of websites table among the present invention, can extract classified website from the navigation classified information of internet obtains, for example have some navigation type websites on the internet, classify to the resource on the internet in these websites, such as be divided into news, military affairs, physical culture, finance and economics, community, love and marriage, purchase by group, a plurality of classifications such as mobile phone, number, fashion, novel, question and answer, game, can list the website that belongs to this classification under each classification, extract these classified websites and just can obtain the categories of websites table.
In addition, the categories of websites table can also obtain by the resource of excavating in the search daily record.The search log recording click page that causes respectively of each inquiry of using in when inquiry of user and each inquiry.The difference that employing causes same queries is clicked the corresponding website of the page as the mode of a class, can classify to the corresponding website of the page of respectively clicking of search log recording.The inquiry of for example searching for user's input in the daily record is " news ", then clicked " Sina News ", " Sohu's news ", " Tengxun's news " three pages in result for retrieval, " Sina News ", " Sohu's news " and " Tengxun's news " these three corresponding websites of the page just can be used as a class like this.
Obtain with after the corresponding website of pending homepage belongs to other candidate website of same class, but execution in step S1012 just.
In step S1012, at first need to obtain the respectively homepage of correspondence of each candidate website, this can realize by following manner:
The website that inquiry is set up in advance and the mapping table between the homepage are to obtain the respectively homepage of correspondence of each candidate website.To obtaining by the mode of mapping table the candidate website of homepage, the name of this candidate website can also be referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extract and satisfy the page of homepage feature as homepage corresponding to this candidate website.For example candidate website is " where going ", can return several result for retrieval, and that page that satisfies the homepage feature in the page corresponding to these results is exactly the homepage of " where going ".Wherein the homepage feature comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title.
Complete URL is made of domain name (comprising Main Domain), catalogue and parameter usually, please refer to Fig. 5, and Fig. 5 is the schematic diagram of URL among the present invention, and wherein catalogue and parameter are not the prerequisite compositions of URL.The URL of homepage should include only domain name, and does not have catalogue or parameter, and in addition, homepage should comprise the authorization information corresponding with the candidate website title, and this authorization information can be literal or diagram.For example comprise " Baidu " printed words and corresponding logo in the homepage of Baidu, should be appreciated that, authorization information is not limited to this, and any sign that can be associated with the candidate website title all can be used as authorization information.
After obtaining homepage corresponding to each candidate website difference, can whether have descriptor by the meta label in the source file of judging these pages and come these pages are chosen, from these pages, choose a plurality of homepages with descriptor in order in step S102, extract the summary template.
Template is that the affiliated type of content that can be extensive has been described in its middle slot position by extensive groove position and fixing a kind of matching tool of consisting of of literal.For example: " [website name] news; press center; the professional current events that include [navigation theme] are reported portal website " such summary template, wherein [website name] and [navigation theme] is template groove position, remainder is fixing literal, and " website name " and " navigation theme " described type under the extensive content in groove position place.
Particularly, the mode of utilizing the descriptor of a plurality of homepages determine to extract the summary template comprises: the descriptor of a plurality of homepages that comparison is chosen, with the identical and parts that content is different of correspondence position in the descriptor of these a plurality of homepages abstract be template groove position.
Following two page-describing information for example:
1. Tengxun's news, press center includes current political news, home news, world news, social news, comment on current affairs, news picture, Special Topics in Journalism, Usenet, military affairs, historical professional current events report portal website
2. Netease's news, press center includes the professional current events report portal website of news, physical culture, amusement, finance and economics, science and technology, house property
By comparing above-mentioned two descriptors, can extract " the professional current events that include [navigation theme] are reported portal website for [website name] news, press center " such template.
With the identical and parts that content is different of correspondence position in a plurality of descriptors abstract be the process of template groove position, can adopt any existing techniques in realizing, for example utilize the mode of vocabulary to mate, should be appreciated that in addition, the mode that extracts the summary template is not confined to above-described this mode, can also be for a plurality of homepages among the step S102, extract template according to vocabulary respectively, again the template that extracts is merged, because specific implementation can be carried out under the mode that it may occur to persons skilled in the art that, this paper no longer is described in detail.
Obtain making a summary after the template, execution in step S103 can obtain the summary of pending homepage.
Can extract anchor text word in the pending homepage among the step S103 as the keyword in the pending homepage.Anchor text word is corresponding with the hyperlink word that links in the webpage source file.Particularly, for the described summary template that comprises name groove position, website and navigation theme groove position of preamble, step S103 comprises: extract the website name of pending homepage and insert the summary template name groove position, website, extract pending homepage have the anchor text word of navigation characteristic and insert the summary template navigation theme groove position.Wherein navigation characteristic refers to:
In the source file of webpage, each anchor text word is distributed in the continuous DIV piece, and includes the main territory consistent with the chained address of pending homepage in the chained address pointed to respectively of each anchor text word, and the average length of each anchor text word meets the setting span.
Navigation characteristic is rear definite by the navigation information of various webpages is analyzed.On visual effect, the anchor text word of concrete navigation characteristic is arranged in the visual zone of an integral body, rather than disperse to distribute, please refer to Fig. 6, Fig. 6 is the schematic diagram that has the anchor text word of navigation characteristic among the present invention.Therefore, from the source file of webpage, the anchor text word with navigation characteristic should be arranged in continuous DIV piece, please refer to Fig. 7, and Fig. 7 is the schematic diagram that has the anchor text word of navigation characteristic in the source file of webpage among the present invention.As shown in Figure 7, anchor text word " news ", " finance and economics ", " science and technology ", " physical culture " etc. are arranged in continuous DIV piece, in addition, there is identical main territory (sina.com.cn) chained address (news.sina.com.cn) that anchor text word " news " points to the chained address (www.sina.com.cn) of homepage, and other anchor text words are also similar with it.In addition, different from other anchor text word in the source file of webpage, have the average length of anchor text word of navigation characteristic usually between 2-4 word, this also provides foundation for anchor text word and other anchor text words that difference has a navigation characteristic.
After in pending homepage, having extracted keyword, the keyword that extracts is filled in the summary template that step S102 obtains, can generates the summary of pending homepage.
For example for summary template " the professional current events that comprise [navigation theme] are reported portal website ", the keyword that extracts from a website homepage has " news, physical culture, amusement, finance and economics, science and technology, house property ", this template groove position of navigation theme inserted in above keyword, can obtain following homepage summary: " the professional current events that comprise news, physical culture, amusement, finance and economics, science and technology, house property are reported portal website ".
And for example for summary template " the professional current events that include [navigation theme] are reported portal website for [website name] news, press center ", suppose that the homepage website of extracting is called " Sohu ", inserts it " website name " groove position; The anchor text word with navigation characteristic that extracts from homepage for " interview, blog, comment, forum, click today, slowly draw slowly live, digital road, the large visual field, news headlines, vision alliance, news flash, news review ", it is inserted " navigation theme " groove position, the summary that can obtain Sohu's homepage for " Sohu's news; press center, include interview, blog, comment, forum, click today, slowly draw slowly live, the professional current events report portal website of digital road, the large visual field, news headlines, vision alliance, news flash, news review ".
Please refer to Fig. 8, Fig. 8 is the structural representation block diagram that automatically generates the device of homepage summary among the present invention.As shown in Figure 8, this device comprises: judging unit 201, homepage determining unit 202, template generation unit 203 and keyword extracting unit 204.
Wherein judging unit 201, are used for judging whether pending homepage has descriptor, if so, then with the summary of this descriptor as pending homepage, carry out otherwise trigger homepage determining unit 202.
Homepage determining unit 202 is used for the homepage that definite a plurality of and pending homepage belongs to same classification website and has descriptor.Particularly, homepage determining unit 202 comprises: the website is determined subelement 2021 and is chosen subelement 2022.
Subelement 2021 is determined in the website, is used for according to default categories of websites table, determines to belong to other candidate website of same class with pending homepage.Choose subelement 2022, be used for obtaining a plurality of homepages with descriptor from homepage corresponding to candidate website.
Particularly, the categories of websites table that subelement 2021 uses is determined in the website, can after from the navigation classified information of internet, extracting classified website, obtain, perhaps, obtain after classifying in the corresponding website of the page by respectively clicking that search is recorded in the daily record, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
Particularly, choosing the mode that subelement 2022 obtains homepage corresponding to candidate website comprises:
The website that inquiry is default and the mapping table between the homepage obtain the respectively homepage of correspondence of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.Wherein the homepage feature comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
Template generation unit 203 is used for utilizing the descriptor of a plurality of homepages of determining to extract the summary template.
Particularly, the mode of template generation unit 203 extraction summary templates comprises:
The descriptor of a plurality of homepages of obtaining of comparison, with the identical and parts that content is different of correspondence position in the descriptor of these a plurality of homepages abstract be template groove position, obtain the template of making a summary.
Keyword extracting unit 204 is used for extracting keyword and being filled to corresponding groove position the summary template from pending homepage, obtains the summary of pending homepage.
Particularly, keyword extracting unit 204 can be extracted anchor text word in the pending homepage as keyword.
As one preferred embodiment, the corresponding groove position of the summary template that template generation unit 203 obtains comprises: the website name and the navigation theme.The mode that keyword extracting unit 204 is extracted keywords specifically comprises: extract the website name of pending homepage and insert the name groove position, website of summary template, extract the anchor text word with navigation characteristic of pending homepage and insert the navigation theme groove position of summary template.
Wherein, navigation characteristic comprises: in the source file of webpage, each anchor text word is distributed in the continuous DIV piece, and includes the main territory consistent with the chained address of pending homepage in the chained address pointed to respectively of each anchor text word, and the average length of each anchor text word meets the setting span.
Should be appreciated that, structure drawing of device shown in Figure 8 is a preferred implementation, and wherein judging unit 201 is not essential features of the present invention, and device of the present invention is not had a restriction effect.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (18)

1. one kind generates the method that homepage is made a summary automatically, comprising:
A, determine that a plurality of and pending homepage belongs to same classification website and has the homepage of descriptor;
The descriptor of a plurality of homepages that B, utilization are determined extracts the summary template;
C, from described pending homepage, extract keyword and be filled to corresponding groove position in the described summary template, obtain the summary of described pending homepage.
2. method according to claim 1 is characterized in that, also comprises before described steps A:
Judge whether described pending homepage exists descriptor, if so, then directly with the summary of described descriptor as described pending homepage; Otherwise, carry out described steps A.
3. method according to claim 1 is characterized in that, described steps A specifically comprises:
A1, the default categories of websites table of basis are determined to belong to other candidate website of same class with described pending homepage;
A2, from homepage corresponding to described candidate website, obtain a plurality of homepages with descriptor.
4. method according to claim 3 is characterized in that, described categories of websites table is to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
5. method according to claim 3 is characterized in that, determines in the described steps A 2 that the step of the homepage that described candidate website is corresponding specifically comprises:
The website that inquiry is default and the mapping table between the homepage are to obtain the respectively homepage of correspondence of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
6. method according to claim 5 is characterized in that, described homepage feature specifically comprises:
Only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
7. method according to claim 1 is characterized in that, described step B specifically comprises:
Compare the descriptor of described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
8. method according to claim 1 is characterized in that, extracts anchor text word among the described step C as keyword from described pending homepage.
9. according to claim 1, it is characterized in that, the corresponding groove position of described summary template comprises: website name and navigation theme;
Described step C specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
10. one kind generates the device that homepage is made a summary automatically, comprising:
The homepage determining unit is used for the homepage that definite a plurality of and pending homepage belongs to same classification website and has descriptor;
The template generation unit is used for utilizing the descriptor of a plurality of homepages of determining to extract the summary template;
Keyword extracting unit is used for extracting keyword and being filled to corresponding groove position the described summary template from described pending homepage, obtains the summary of described pending homepage.
11. device according to claim 10, it is characterized in that, described device further comprises judging unit, described judging unit is connected to described homepage determining unit, be used for judging whether described pending homepage exists descriptor, if so, then directly with the summary of described descriptor as described pending homepage, carry out otherwise trigger described homepage determining unit.
12. device according to claim 10 is characterized in that, described homepage determining unit specifically comprises:
Subelement is determined in the website, is used for according to default categories of websites table, determines to belong to other candidate website of same class with described pending homepage;
Choose subelement, be used for obtaining a plurality of homepages with descriptor from homepage corresponding to described candidate website.
13. device according to claim 12 is characterized in that, described categories of websites table is to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
14. device according to claim 12 is characterized in that, the described subelement of choosing determines that the mode of the homepage that described candidate website is corresponding specifically comprises:
The website that inquiry is default and the mapping table between the homepage are to obtain the respectively homepage of correspondence of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
15. device according to claim 14 is characterized in that, described homepage feature specifically comprises:
Only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
16. device according to claim 10 is characterized in that, the mode that described template generation unit extracts the summary template specifically comprises:
Compare the descriptor of described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
17. device according to claim 10 is characterized in that, described keyword extracting unit is extracted anchor text word as keyword from pending homepage.
18. device according to claim 10 is characterized in that, the corresponding groove position of described summary template comprises: website name and navigation theme;
The mode that described keyword extracting unit is extracted keyword specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
CN2012100754141A 2012-03-21 2012-03-21 Method and device for automatic generating of front page abstract Pending CN103324622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100754141A CN103324622A (en) 2012-03-21 2012-03-21 Method and device for automatic generating of front page abstract

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100754141A CN103324622A (en) 2012-03-21 2012-03-21 Method and device for automatic generating of front page abstract

Publications (1)

Publication Number Publication Date
CN103324622A true CN103324622A (en) 2013-09-25

Family

ID=49193370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100754141A Pending CN103324622A (en) 2012-03-21 2012-03-21 Method and device for automatic generating of front page abstract

Country Status (1)

Country Link
CN (1) CN103324622A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786853A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Display method and system for smart abstract of forum post
CN105786834A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating structured abstract of social webpage
CN105786849A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating document web page custom abstract
CN105786837A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating intelligent abstract of novel webpage
CN105786841A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating smart abstract of news webpage
CN105786835A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for displaying user-defined abstract of picture webpage in search result
CN105786836A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating structured abstract of video webpage
CN106407344A (en) * 2016-09-06 2017-02-15 努比亚技术有限公司 Method and system for generating search engine optimization label
CN109189530A (en) * 2018-08-27 2019-01-11 李东正 A kind of data processing method and medical care table based on medical care table
CN109684473A (en) * 2018-12-28 2019-04-26 丹翰智能科技(上海)有限公司 A kind of automatic bulletin generation method and system
CN110059309A (en) * 2018-01-18 2019-07-26 北京京东尚科信息技术有限公司 The generation method and device of information object title
CN110059163A (en) * 2019-04-29 2019-07-26 百度在线网络技术(北京)有限公司 Generate method and apparatus, the electronic equipment, computer-readable medium of template
CN110264315A (en) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 Recommended information generation method and device
CN110555199A (en) * 2018-06-01 2019-12-10 北京百度网讯科技有限公司 article generation method, device and equipment based on hotspot materials and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010000537A1 (en) * 1998-12-08 2001-04-26 Inala Suman Kumar Method and apparatus for obtaining and presenting WEB summaries to users
CN1335572A (en) * 2000-07-25 2002-02-13 金网专线通有限公司 Searching system and method for first page of searching web
CN1435775A (en) * 2002-01-31 2003-08-13 百度在线网络技术(北京)有限公司 Method for identifying mirror and quasi-mirror web sites over internet
CN1444160A (en) * 2003-02-17 2003-09-24 刘莎 Integrated information structured abstract service system and service method
US20050246410A1 (en) * 2004-04-30 2005-11-03 Microsoft Corporation Method and system for classifying display pages using summaries
CN101458713A (en) * 2008-12-29 2009-06-17 北京搜狗科技发展有限公司 Website classifying method and system
WO2009120426A2 (en) * 2008-03-28 2009-10-01 Microsoft Corporation Automatic customization and rendering of ads based on detected features in a web page
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature
US20110087671A1 (en) * 2009-10-14 2011-04-14 National Chiao Tung University Document Processing System and Method Thereof

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010000537A1 (en) * 1998-12-08 2001-04-26 Inala Suman Kumar Method and apparatus for obtaining and presenting WEB summaries to users
CN1335572A (en) * 2000-07-25 2002-02-13 金网专线通有限公司 Searching system and method for first page of searching web
CN1435775A (en) * 2002-01-31 2003-08-13 百度在线网络技术(北京)有限公司 Method for identifying mirror and quasi-mirror web sites over internet
CN1444160A (en) * 2003-02-17 2003-09-24 刘莎 Integrated information structured abstract service system and service method
US20050246410A1 (en) * 2004-04-30 2005-11-03 Microsoft Corporation Method and system for classifying display pages using summaries
WO2009120426A2 (en) * 2008-03-28 2009-10-01 Microsoft Corporation Automatic customization and rendering of ads based on detected features in a web page
CN101458713A (en) * 2008-12-29 2009-06-17 北京搜狗科技发展有限公司 Website classifying method and system
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature
US20110087671A1 (en) * 2009-10-14 2011-04-14 National Chiao Tung University Document Processing System and Method Thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
崔灵珍: "《Web文本摘要技术的研究与应用》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
沈焕生等: "《基于信息抽取的自动摘要生成技术》", 《2009年中国信息技术应用学术研讨会论文集》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786834A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating structured abstract of social webpage
CN105786849A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating document web page custom abstract
CN105786837A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating intelligent abstract of novel webpage
CN105786841A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating smart abstract of news webpage
CN105786835A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for displaying user-defined abstract of picture webpage in search result
CN105786836A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating structured abstract of video webpage
CN105786853A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Display method and system for smart abstract of forum post
CN106407344B (en) * 2016-09-06 2019-11-15 努比亚技术有限公司 A kind of method and system generating search engine optimization label
CN106407344A (en) * 2016-09-06 2017-02-15 努比亚技术有限公司 Method and system for generating search engine optimization label
CN110059309A (en) * 2018-01-18 2019-07-26 北京京东尚科信息技术有限公司 The generation method and device of information object title
CN110555199B (en) * 2018-06-01 2023-07-04 北京百度网讯科技有限公司 Article generation method, device, equipment and storage medium based on hotspot materials
CN110555199A (en) * 2018-06-01 2019-12-10 北京百度网讯科技有限公司 article generation method, device and equipment based on hotspot materials and storage medium
CN109189530A (en) * 2018-08-27 2019-01-11 李东正 A kind of data processing method and medical care table based on medical care table
CN109684473A (en) * 2018-12-28 2019-04-26 丹翰智能科技(上海)有限公司 A kind of automatic bulletin generation method and system
CN110059163A (en) * 2019-04-29 2019-07-26 百度在线网络技术(北京)有限公司 Generate method and apparatus, the electronic equipment, computer-readable medium of template
CN110264315A (en) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 Recommended information generation method and device

Similar Documents

Publication Publication Date Title
CN103324622A (en) Method and device for automatic generating of front page abstract
Lim et al. Multiple sets of features for automatic genre classification of web documents
Madhavan et al. Harnessing the deep web: Present and future
Jäschke et al. Tag recommendations in folksonomies
US9069857B2 (en) Per-document index for semantic searching
Balakrishnan et al. Applying webtables in practice
Beinglass et al. Articulated object recognition, or: How to generalize the generalized hough transform
US20140115439A1 (en) Methods and systems for annotating web pages and managing annotations and annotated web pages
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN102567494B (en) Website classification method and device
CN103678564A (en) Internet product research system based on data mining
CN102456054B (en) A kind of searching method and system
EP2425353A1 (en) Method and apparatus for identifying synonyms and using synonyms to search
CN103617174A (en) Distributed searching method based on cloud computing
CN104715064A (en) Method and server for marking keywords on webpage
CN101639857A (en) Method, device and system for establishing knowledge questioning and answering sharing platform
CN104123366A (en) Search method and server
US9280522B2 (en) Highlighting of document elements
CN104598577A (en) Extraction method for webpage text
CN103838798A (en) Page classification system and method
CN104915422A (en) Webpage collecting method and device based on browser
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
KR102107474B1 (en) Social issue deduction system and method using crawling
Jalal Text Mining: Design of Interactive Search Engine Based Regular Expressions of Online Automobile Advertisements.
CN104778232A (en) Searching result optimizing method and device based on long query

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130925