CN103324622A - Method and device for automatic generating of front page abstract - Google Patents
Method and device for automatic generating of front page abstract Download PDFInfo
- Publication number
- CN103324622A CN103324622A CN2012100754141A CN201210075414A CN103324622A CN 103324622 A CN103324622 A CN 103324622A CN 2012100754141 A CN2012100754141 A CN 2012100754141A CN 201210075414 A CN201210075414 A CN 201210075414A CN 103324622 A CN103324622 A CN 103324622A
- Authority
- CN
- China
- Prior art keywords
- homepage
- website
- descriptor
- pending
- template
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention provides a method and a device for the automatic generating of a front page abstract. The method comprises the following steps: A, determining a plurality of front pages provided with describing information and belonging to the same category website as a to be processed front page; B, extracting the abstract module by the determined describing information of the front pages; C, extracting key words from the to be extracted front page and filling the key words to corresponding grooves positioned in the abstract module, so as to obtain the abstract of the to be processed front page. The method provided by the invention can improve the accuracy of the abstract for describing themes of a front page.
Description
[technical field]
The present invention relates to natural language processing technique, particularly a kind of method and device of automatic generation homepage summary.
[background technology]
For search engine; when providing result for retrieval to the user; except providing the link of result for retrieval, usually also can below the link of result for retrieval, provide the summary info that the page is pointed in link, understand rapidly the main contents that this links the page pointed to help the user.Please refer to Fig. 1, Fig. 1 is for providing the schematic diagram of summary info for corresponding webpage in the result for retrieval of search engine.For summary info as shown in Figure 1 is provided to the user, search engine at first needs to extract summary from the page.The meta label of source file is not provided a description the generic web page of information, search engine can be by extracting the keyword of the page as the summary of this page to the semantic analysis of content of pages, but the webpage of descriptor is provided in the meta label to source file, and search engine can be with the summary of this descriptor as respective page.Please refer to Fig. 2, Fig. 2 is the schematic diagram that comprises descriptor in the meta label of webpage source file.
The homepage of website is the default page when opening a website, usually the homepage of website has the effect of navigation, therefore the information that comprises is more mixed and disorderly, when homepage is carried out semantic analysis, be difficult to obtain accurately homepage theme, when this just causes the homepage that lacks descriptor in to the meta label of source file when the mode that adopts the said extracted summary to carry out abstract extraction, obtain the mixed and disorderly summary info of content, affected the accuracy of summary info.Please refer to Fig. 3, the summary diagram of Fig. 3 for adopting prior art that the homepage that lacks descriptor is extracted.Compare with the descriptor shown in Fig. 2, can find out that the summary that shows among Fig. 3 lacks consistent semantic logic, relatively poor to the accuracy of homepage subject description.
[summary of the invention]
Technical matters to be solved by this invention provides a kind of method and device of automatic generation homepage summary, to solve the defective of prior art automatic summary poor accuracy that generates when the meta of homepage label lacks descriptor.
The present invention is the method that technical scheme that the technical solution problem adopts provides a kind of automatic generation homepage summary, comprising: A, determine that a plurality of and pending homepage belongs to same classification website and has the homepage of descriptor; The descriptor of a plurality of homepages that B, utilization are determined extracts the summary template; C, from described pending homepage, extract keyword and be filled to corresponding groove position in the described summary template, obtain the summary of described pending homepage.
The preferred embodiment one of according to the present invention also comprised before described steps A: judge whether described pending homepage exists descriptor, if so, then directly with the summary of described descriptor as described pending homepage; Otherwise, carry out described steps A.
The preferred embodiment one of according to the present invention, described steps A specifically comprises: A1, according to default categories of websites table, determine to belong to other candidate website of same class with described pending homepage; A2, from homepage corresponding to described candidate website, obtain a plurality of homepages with descriptor.
The preferred embodiment one of according to the present invention, described categories of websites table are to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
The preferred embodiment one of according to the present invention, determine in the described steps A 2 that the step of the homepage that described candidate website is corresponding specifically comprises: the website that inquiry is default and the mapping table between the homepage, to obtain respectively corresponding homepage of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
The preferred embodiment one of according to the present invention, described homepage feature specifically comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
The preferred embodiment one of according to the present invention, described step B specifically comprises: compare the descriptor of described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
The preferred embodiment one of according to the present invention extracts anchor text word as keyword from described pending homepage among the described step C.
The preferred embodiment one of according to the present invention, the corresponding groove position of described summary template comprises: website name and navigation theme; Described step C specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
The present invention also provides a kind of device of automatic generation homepage summary, comprising: the homepage determining unit is used for the homepage that definite a plurality of and pending homepage belongs to same classification website and has descriptor; The template generation unit is used for utilizing the descriptor of a plurality of homepages of determining to extract the summary template; Keyword extracting unit is used for extracting keyword and being filled to corresponding groove position the described summary template from described pending homepage, obtains the summary of described pending homepage.
The preferred embodiment one of according to the present invention, described device further comprises judging unit, described judging unit is connected to described homepage determining unit, be used for judging whether described pending homepage exists descriptor, if, then directly with the summary of described descriptor as described pending homepage, carry out otherwise trigger described homepage determining unit.
The preferred embodiment one of according to the present invention, described homepage determining unit specifically comprises: subelement is determined in the website, is used for according to default categories of websites table, determines to belong to other candidate website of same class with described pending homepage; Choose subelement, be used for obtaining a plurality of homepages with descriptor from homepage corresponding to described candidate website.
The preferred embodiment one of according to the present invention, described categories of websites table are to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
The preferred embodiment one of according to the present invention, the described subelement of choosing determines that the mode of the homepage that described candidate website is corresponding specifically comprises: the website that inquiry is default and the mapping table between the homepage, to obtain respectively corresponding homepage of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
The preferred embodiment one of according to the present invention, described homepage feature specifically comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
The preferred embodiment one of according to the present invention, the mode that described template generation unit extracts the summary template specifically comprises: the descriptor of comparing described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
The preferred embodiment one of according to the present invention, described keyword extracting unit is extracted anchor text word as keyword from pending homepage.
The preferred embodiment one of according to the present invention, the corresponding groove position of described summary template comprises: website name and navigation theme; The mode that described keyword extracting unit is extracted keyword specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
As can be seen from the above technical solutions, by from the similar of pending homepage and have and extract the summary template the website homepage of descriptor, and the template of will making a summary combines with keyword in the pending homepage, can be for lacking the good summary of the automatic generating structure of homepage of descriptor in the meta label, compared with prior art, greatly strengthened the accuracy of summary to the homepage subject description.
[description of drawings]
Fig. 1 is for providing the schematic diagram of summary info for corresponding webpage in the result for retrieval of search engine;
Fig. 2 is the schematic diagram that comprises descriptor in the meta label of webpage source file;
The summary diagram of Fig. 3 for adopting prior art that the homepage that lacks descriptor is extracted;
Fig. 4 is the process flow diagram that automatically generates the method for homepage summary among the present invention;
Fig. 5 is the schematic diagram of URL among the present invention;
Fig. 6 is the schematic diagram that has the anchor text word of navigation characteristic among the present invention on the webpage;
Fig. 7 is the schematic diagram that has the anchor text word of navigation characteristic on the source file of webpage among the present invention.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Please refer to Fig. 4, Fig. 4 is the process flow diagram that automatically generates the method for homepage summary among the present invention.As shown in Figure 4, the method comprises:
S101: determine that a plurality of and pending homepage belongs to same classification website and has the homepage of descriptor.
S102: utilize the descriptor of a plurality of homepages of determining to extract the summary template.
S103: from pending homepage, extract keyword and be filled to corresponding groove position in the summary template, obtain the summary of pending homepage.
The below is specifically described above-mentioned steps.
Step S101 comprises particularly:
Step S1011: according to default categories of websites table, determine to belong to other candidate website of same class with pending homepage.
Step S1012: from homepage corresponding to candidate website, obtain a plurality of homepages with descriptor.
In the categories of websites table classified in each website, when a known website, by searching the categories of websites table, just can know other websites similar with this website.
Categories of websites table among the present invention, can extract classified website from the navigation classified information of internet obtains, for example have some navigation type websites on the internet, classify to the resource on the internet in these websites, such as be divided into news, military affairs, physical culture, finance and economics, community, love and marriage, purchase by group, a plurality of classifications such as mobile phone, number, fashion, novel, question and answer, game, can list the website that belongs to this classification under each classification, extract these classified websites and just can obtain the categories of websites table.
In addition, the categories of websites table can also obtain by the resource of excavating in the search daily record.The search log recording click page that causes respectively of each inquiry of using in when inquiry of user and each inquiry.The difference that employing causes same queries is clicked the corresponding website of the page as the mode of a class, can classify to the corresponding website of the page of respectively clicking of search log recording.The inquiry of for example searching for user's input in the daily record is " news ", then clicked " Sina News ", " Sohu's news ", " Tengxun's news " three pages in result for retrieval, " Sina News ", " Sohu's news " and " Tengxun's news " these three corresponding websites of the page just can be used as a class like this.
Obtain with after the corresponding website of pending homepage belongs to other candidate website of same class, but execution in step S1012 just.
In step S1012, at first need to obtain the respectively homepage of correspondence of each candidate website, this can realize by following manner:
The website that inquiry is set up in advance and the mapping table between the homepage are to obtain the respectively homepage of correspondence of each candidate website.To obtaining by the mode of mapping table the candidate website of homepage, the name of this candidate website can also be referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extract and satisfy the page of homepage feature as homepage corresponding to this candidate website.For example candidate website is " where going ", can return several result for retrieval, and that page that satisfies the homepage feature in the page corresponding to these results is exactly the homepage of " where going ".Wherein the homepage feature comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title.
Complete URL is made of domain name (comprising Main Domain), catalogue and parameter usually, please refer to Fig. 5, and Fig. 5 is the schematic diagram of URL among the present invention, and wherein catalogue and parameter are not the prerequisite compositions of URL.The URL of homepage should include only domain name, and does not have catalogue or parameter, and in addition, homepage should comprise the authorization information corresponding with the candidate website title, and this authorization information can be literal or diagram.For example comprise " Baidu " printed words and corresponding logo in the homepage of Baidu, should be appreciated that, authorization information is not limited to this, and any sign that can be associated with the candidate website title all can be used as authorization information.
After obtaining homepage corresponding to each candidate website difference, can whether have descriptor by the meta label in the source file of judging these pages and come these pages are chosen, from these pages, choose a plurality of homepages with descriptor in order in step S102, extract the summary template.
Template is that the affiliated type of content that can be extensive has been described in its middle slot position by extensive groove position and fixing a kind of matching tool of consisting of of literal.For example: " [website name] news; press center; the professional current events that include [navigation theme] are reported portal website " such summary template, wherein [website name] and [navigation theme] is template groove position, remainder is fixing literal, and " website name " and " navigation theme " described type under the extensive content in groove position place.
Particularly, the mode of utilizing the descriptor of a plurality of homepages determine to extract the summary template comprises: the descriptor of a plurality of homepages that comparison is chosen, with the identical and parts that content is different of correspondence position in the descriptor of these a plurality of homepages abstract be template groove position.
Following two page-describing information for example:
1. Tengxun's news, press center includes current political news, home news, world news, social news, comment on current affairs, news picture, Special Topics in Journalism, Usenet, military affairs, historical professional current events report portal website
2. Netease's news, press center includes the professional current events report portal website of news, physical culture, amusement, finance and economics, science and technology, house property
By comparing above-mentioned two descriptors, can extract " the professional current events that include [navigation theme] are reported portal website for [website name] news, press center " such template.
With the identical and parts that content is different of correspondence position in a plurality of descriptors abstract be the process of template groove position, can adopt any existing techniques in realizing, for example utilize the mode of vocabulary to mate, should be appreciated that in addition, the mode that extracts the summary template is not confined to above-described this mode, can also be for a plurality of homepages among the step S102, extract template according to vocabulary respectively, again the template that extracts is merged, because specific implementation can be carried out under the mode that it may occur to persons skilled in the art that, this paper no longer is described in detail.
Obtain making a summary after the template, execution in step S103 can obtain the summary of pending homepage.
Can extract anchor text word in the pending homepage among the step S103 as the keyword in the pending homepage.Anchor text word is corresponding with the hyperlink word that links in the webpage source file.Particularly, for the described summary template that comprises name groove position, website and navigation theme groove position of preamble, step S103 comprises: extract the website name of pending homepage and insert the summary template name groove position, website, extract pending homepage have the anchor text word of navigation characteristic and insert the summary template navigation theme groove position.Wherein navigation characteristic refers to:
In the source file of webpage, each anchor text word is distributed in the continuous DIV piece, and includes the main territory consistent with the chained address of pending homepage in the chained address pointed to respectively of each anchor text word, and the average length of each anchor text word meets the setting span.
Navigation characteristic is rear definite by the navigation information of various webpages is analyzed.On visual effect, the anchor text word of concrete navigation characteristic is arranged in the visual zone of an integral body, rather than disperse to distribute, please refer to Fig. 6, Fig. 6 is the schematic diagram that has the anchor text word of navigation characteristic among the present invention.Therefore, from the source file of webpage, the anchor text word with navigation characteristic should be arranged in continuous DIV piece, please refer to Fig. 7, and Fig. 7 is the schematic diagram that has the anchor text word of navigation characteristic in the source file of webpage among the present invention.As shown in Figure 7, anchor text word " news ", " finance and economics ", " science and technology ", " physical culture " etc. are arranged in continuous DIV piece, in addition, there is identical main territory (sina.com.cn) chained address (news.sina.com.cn) that anchor text word " news " points to the chained address (www.sina.com.cn) of homepage, and other anchor text words are also similar with it.In addition, different from other anchor text word in the source file of webpage, have the average length of anchor text word of navigation characteristic usually between 2-4 word, this also provides foundation for anchor text word and other anchor text words that difference has a navigation characteristic.
After in pending homepage, having extracted keyword, the keyword that extracts is filled in the summary template that step S102 obtains, can generates the summary of pending homepage.
For example for summary template " the professional current events that comprise [navigation theme] are reported portal website ", the keyword that extracts from a website homepage has " news, physical culture, amusement, finance and economics, science and technology, house property ", this template groove position of navigation theme inserted in above keyword, can obtain following homepage summary: " the professional current events that comprise news, physical culture, amusement, finance and economics, science and technology, house property are reported portal website ".
And for example for summary template " the professional current events that include [navigation theme] are reported portal website for [website name] news, press center ", suppose that the homepage website of extracting is called " Sohu ", inserts it " website name " groove position; The anchor text word with navigation characteristic that extracts from homepage for " interview, blog, comment, forum, click today, slowly draw slowly live, digital road, the large visual field, news headlines, vision alliance, news flash, news review ", it is inserted " navigation theme " groove position, the summary that can obtain Sohu's homepage for " Sohu's news; press center, include interview, blog, comment, forum, click today, slowly draw slowly live, the professional current events report portal website of digital road, the large visual field, news headlines, vision alliance, news flash, news review ".
Please refer to Fig. 8, Fig. 8 is the structural representation block diagram that automatically generates the device of homepage summary among the present invention.As shown in Figure 8, this device comprises: judging unit 201, homepage determining unit 202, template generation unit 203 and keyword extracting unit 204.
Wherein judging unit 201, are used for judging whether pending homepage has descriptor, if so, then with the summary of this descriptor as pending homepage, carry out otherwise trigger homepage determining unit 202.
Particularly, the categories of websites table that subelement 2021 uses is determined in the website, can after from the navigation classified information of internet, extracting classified website, obtain, perhaps, obtain after classifying in the corresponding website of the page by respectively clicking that search is recorded in the daily record, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
Particularly, choosing the mode that subelement 2022 obtains homepage corresponding to candidate website comprises:
The website that inquiry is default and the mapping table between the homepage obtain the respectively homepage of correspondence of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.Wherein the homepage feature comprises: only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
Particularly, the mode of template generation unit 203 extraction summary templates comprises:
The descriptor of a plurality of homepages of obtaining of comparison, with the identical and parts that content is different of correspondence position in the descriptor of these a plurality of homepages abstract be template groove position, obtain the template of making a summary.
Particularly, keyword extracting unit 204 can be extracted anchor text word in the pending homepage as keyword.
As one preferred embodiment, the corresponding groove position of the summary template that template generation unit 203 obtains comprises: the website name and the navigation theme.The mode that keyword extracting unit 204 is extracted keywords specifically comprises: extract the website name of pending homepage and insert the name groove position, website of summary template, extract the anchor text word with navigation characteristic of pending homepage and insert the navigation theme groove position of summary template.
Wherein, navigation characteristic comprises: in the source file of webpage, each anchor text word is distributed in the continuous DIV piece, and includes the main territory consistent with the chained address of pending homepage in the chained address pointed to respectively of each anchor text word, and the average length of each anchor text word meets the setting span.
Should be appreciated that, structure drawing of device shown in Figure 8 is a preferred implementation, and wherein judging unit 201 is not essential features of the present invention, and device of the present invention is not had a restriction effect.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.
Claims (18)
1. one kind generates the method that homepage is made a summary automatically, comprising:
A, determine that a plurality of and pending homepage belongs to same classification website and has the homepage of descriptor;
The descriptor of a plurality of homepages that B, utilization are determined extracts the summary template;
C, from described pending homepage, extract keyword and be filled to corresponding groove position in the described summary template, obtain the summary of described pending homepage.
2. method according to claim 1 is characterized in that, also comprises before described steps A:
Judge whether described pending homepage exists descriptor, if so, then directly with the summary of described descriptor as described pending homepage; Otherwise, carry out described steps A.
3. method according to claim 1 is characterized in that, described steps A specifically comprises:
A1, the default categories of websites table of basis are determined to belong to other candidate website of same class with described pending homepage;
A2, from homepage corresponding to described candidate website, obtain a plurality of homepages with descriptor.
4. method according to claim 3 is characterized in that, described categories of websites table is to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
5. method according to claim 3 is characterized in that, determines in the described steps A 2 that the step of the homepage that described candidate website is corresponding specifically comprises:
The website that inquiry is default and the mapping table between the homepage are to obtain the respectively homepage of correspondence of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
6. method according to claim 5 is characterized in that, described homepage feature specifically comprises:
Only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
7. method according to claim 1 is characterized in that, described step B specifically comprises:
Compare the descriptor of described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
8. method according to claim 1 is characterized in that, extracts anchor text word among the described step C as keyword from described pending homepage.
9. according to claim 1, it is characterized in that, the corresponding groove position of described summary template comprises: website name and navigation theme;
Described step C specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
10. one kind generates the device that homepage is made a summary automatically, comprising:
The homepage determining unit is used for the homepage that definite a plurality of and pending homepage belongs to same classification website and has descriptor;
The template generation unit is used for utilizing the descriptor of a plurality of homepages of determining to extract the summary template;
Keyword extracting unit is used for extracting keyword and being filled to corresponding groove position the described summary template from described pending homepage, obtains the summary of described pending homepage.
11. device according to claim 10, it is characterized in that, described device further comprises judging unit, described judging unit is connected to described homepage determining unit, be used for judging whether described pending homepage exists descriptor, if so, then directly with the summary of described descriptor as described pending homepage, carry out otherwise trigger described homepage determining unit.
12. device according to claim 10 is characterized in that, described homepage determining unit specifically comprises:
Subelement is determined in the website, is used for according to default categories of websites table, determines to belong to other candidate website of same class with described pending homepage;
Choose subelement, be used for obtaining a plurality of homepages with descriptor from homepage corresponding to described candidate website.
13. device according to claim 12 is characterized in that, described categories of websites table is to obtain by extract classified website from the navigation classified information of internet after; Perhaps, be by respectively clicking of log recording of search obtained after classifying in the corresponding website of the page, wherein the strategy that adopts of classification is that the difference that same queries causes is clicked the corresponding website of the page as a class.
14. device according to claim 12 is characterized in that, the described subelement of choosing determines that the mode of the homepage that described candidate website is corresponding specifically comprises:
The website that inquiry is default and the mapping table between the homepage are to obtain the respectively homepage of correspondence of each candidate website; Perhaps, for each candidate website, the name of this candidate website is referred to as the result for retrieval that searching keyword returns to obtain search engine, and from result for retrieval, extracts and satisfy the page of homepage feature as homepage corresponding to this candidate website.
15. device according to claim 14 is characterized in that, described homepage feature specifically comprises:
Only comprise domain name among the URL of the page, and the page comprises the authorization information corresponding with the candidate website title, described authorization information comprises literal or diagram.
16. device according to claim 10 is characterized in that, the mode that described template generation unit extracts the summary template specifically comprises:
Compare the descriptor of described a plurality of homepages, with the identical and parts that content is different of correspondence position in the descriptor of described a plurality of homepages abstract be template groove position, obtain the template of making a summary.
17. device according to claim 10 is characterized in that, described keyword extracting unit is extracted anchor text word as keyword from pending homepage.
18. device according to claim 10 is characterized in that, the corresponding groove position of described summary template comprises: website name and navigation theme;
The mode that described keyword extracting unit is extracted keyword specifically comprises: extract the website name of described pending homepage and insert the name groove position, website of described summary template, extract the navigation theme groove position that described pending homepage has the anchor text word of navigation characteristic and inserts described summary template.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100754141A CN103324622A (en) | 2012-03-21 | 2012-03-21 | Method and device for automatic generating of front page abstract |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100754141A CN103324622A (en) | 2012-03-21 | 2012-03-21 | Method and device for automatic generating of front page abstract |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103324622A true CN103324622A (en) | 2013-09-25 |
Family
ID=49193370
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012100754141A Pending CN103324622A (en) | 2012-03-21 | 2012-03-21 | Method and device for automatic generating of front page abstract |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103324622A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105786853A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Display method and system for smart abstract of forum post |
CN105786834A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating structured abstract of social webpage |
CN105786849A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating document web page custom abstract |
CN105786837A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating intelligent abstract of novel webpage |
CN105786841A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating smart abstract of news webpage |
CN105786835A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for displaying user-defined abstract of picture webpage in search result |
CN105786836A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating structured abstract of video webpage |
CN106407344A (en) * | 2016-09-06 | 2017-02-15 | 努比亚技术有限公司 | Method and system for generating search engine optimization label |
CN109189530A (en) * | 2018-08-27 | 2019-01-11 | 李东正 | A kind of data processing method and medical care table based on medical care table |
CN109684473A (en) * | 2018-12-28 | 2019-04-26 | 丹翰智能科技(上海)有限公司 | A kind of automatic bulletin generation method and system |
CN110059309A (en) * | 2018-01-18 | 2019-07-26 | 北京京东尚科信息技术有限公司 | The generation method and device of information object title |
CN110059163A (en) * | 2019-04-29 | 2019-07-26 | 百度在线网络技术(北京)有限公司 | Generate method and apparatus, the electronic equipment, computer-readable medium of template |
CN110264315A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Recommended information generation method and device |
CN110555199A (en) * | 2018-06-01 | 2019-12-10 | 北京百度网讯科技有限公司 | article generation method, device and equipment based on hotspot materials and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010000537A1 (en) * | 1998-12-08 | 2001-04-26 | Inala Suman Kumar | Method and apparatus for obtaining and presenting WEB summaries to users |
CN1335572A (en) * | 2000-07-25 | 2002-02-13 | 金网专线通有限公司 | Searching system and method for first page of searching web |
CN1435775A (en) * | 2002-01-31 | 2003-08-13 | 百度在线网络技术(北京)有限公司 | Method for identifying mirror and quasi-mirror web sites over internet |
CN1444160A (en) * | 2003-02-17 | 2003-09-24 | 刘莎 | Integrated information structured abstract service system and service method |
US20050246410A1 (en) * | 2004-04-30 | 2005-11-03 | Microsoft Corporation | Method and system for classifying display pages using summaries |
CN101458713A (en) * | 2008-12-29 | 2009-06-17 | 北京搜狗科技发展有限公司 | Website classifying method and system |
WO2009120426A2 (en) * | 2008-03-28 | 2009-10-01 | Microsoft Corporation | Automatic customization and rendering of ads based on detected features in a web page |
CN101667194A (en) * | 2009-09-29 | 2010-03-10 | 北京大学 | Automatic abstracting method and system based on user comment text feature |
US20110087671A1 (en) * | 2009-10-14 | 2011-04-14 | National Chiao Tung University | Document Processing System and Method Thereof |
-
2012
- 2012-03-21 CN CN2012100754141A patent/CN103324622A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010000537A1 (en) * | 1998-12-08 | 2001-04-26 | Inala Suman Kumar | Method and apparatus for obtaining and presenting WEB summaries to users |
CN1335572A (en) * | 2000-07-25 | 2002-02-13 | 金网专线通有限公司 | Searching system and method for first page of searching web |
CN1435775A (en) * | 2002-01-31 | 2003-08-13 | 百度在线网络技术(北京)有限公司 | Method for identifying mirror and quasi-mirror web sites over internet |
CN1444160A (en) * | 2003-02-17 | 2003-09-24 | 刘莎 | Integrated information structured abstract service system and service method |
US20050246410A1 (en) * | 2004-04-30 | 2005-11-03 | Microsoft Corporation | Method and system for classifying display pages using summaries |
WO2009120426A2 (en) * | 2008-03-28 | 2009-10-01 | Microsoft Corporation | Automatic customization and rendering of ads based on detected features in a web page |
CN101458713A (en) * | 2008-12-29 | 2009-06-17 | 北京搜狗科技发展有限公司 | Website classifying method and system |
CN101667194A (en) * | 2009-09-29 | 2010-03-10 | 北京大学 | Automatic abstracting method and system based on user comment text feature |
US20110087671A1 (en) * | 2009-10-14 | 2011-04-14 | National Chiao Tung University | Document Processing System and Method Thereof |
Non-Patent Citations (2)
Title |
---|
崔灵珍: "《Web文本摘要技术的研究与应用》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
沈焕生等: "《基于信息抽取的自动摘要生成技术》", 《2009年中国信息技术应用学术研讨会论文集》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105786834A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating structured abstract of social webpage |
CN105786849A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating document web page custom abstract |
CN105786837A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating intelligent abstract of novel webpage |
CN105786841A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating smart abstract of news webpage |
CN105786835A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for displaying user-defined abstract of picture webpage in search result |
CN105786836A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for generating structured abstract of video webpage |
CN105786853A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Display method and system for smart abstract of forum post |
CN106407344B (en) * | 2016-09-06 | 2019-11-15 | 努比亚技术有限公司 | A kind of method and system generating search engine optimization label |
CN106407344A (en) * | 2016-09-06 | 2017-02-15 | 努比亚技术有限公司 | Method and system for generating search engine optimization label |
CN110059309A (en) * | 2018-01-18 | 2019-07-26 | 北京京东尚科信息技术有限公司 | The generation method and device of information object title |
CN110555199B (en) * | 2018-06-01 | 2023-07-04 | 北京百度网讯科技有限公司 | Article generation method, device, equipment and storage medium based on hotspot materials |
CN110555199A (en) * | 2018-06-01 | 2019-12-10 | 北京百度网讯科技有限公司 | article generation method, device and equipment based on hotspot materials and storage medium |
CN109189530A (en) * | 2018-08-27 | 2019-01-11 | 李东正 | A kind of data processing method and medical care table based on medical care table |
CN109684473A (en) * | 2018-12-28 | 2019-04-26 | 丹翰智能科技(上海)有限公司 | A kind of automatic bulletin generation method and system |
CN110059163A (en) * | 2019-04-29 | 2019-07-26 | 百度在线网络技术(北京)有限公司 | Generate method and apparatus, the electronic equipment, computer-readable medium of template |
CN110264315A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Recommended information generation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103324622A (en) | Method and device for automatic generating of front page abstract | |
Lim et al. | Multiple sets of features for automatic genre classification of web documents | |
Madhavan et al. | Harnessing the deep web: Present and future | |
Jäschke et al. | Tag recommendations in folksonomies | |
US9069857B2 (en) | Per-document index for semantic searching | |
Balakrishnan et al. | Applying webtables in practice | |
Beinglass et al. | Articulated object recognition, or: How to generalize the generalized hough transform | |
US20140115439A1 (en) | Methods and systems for annotating web pages and managing annotations and annotated web pages | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN102567494B (en) | Website classification method and device | |
CN103678564A (en) | Internet product research system based on data mining | |
CN102456054B (en) | A kind of searching method and system | |
EP2425353A1 (en) | Method and apparatus for identifying synonyms and using synonyms to search | |
CN103617174A (en) | Distributed searching method based on cloud computing | |
CN104715064A (en) | Method and server for marking keywords on webpage | |
CN101639857A (en) | Method, device and system for establishing knowledge questioning and answering sharing platform | |
CN104123366A (en) | Search method and server | |
US9280522B2 (en) | Highlighting of document elements | |
CN104598577A (en) | Extraction method for webpage text | |
CN103838798A (en) | Page classification system and method | |
CN104915422A (en) | Webpage collecting method and device based on browser | |
CN103116635A (en) | Field-oriented method and system for collecting invisible web resources | |
KR102107474B1 (en) | Social issue deduction system and method using crawling | |
Jalal | Text Mining: Design of Interactive Search Engine Based Regular Expressions of Online Automobile Advertisements. | |
CN104778232A (en) | Searching result optimizing method and device based on long query |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130925 |