WO2015196910A1 - Search engine-based summary information extraction method, apparatus and search engine - Google Patents

Search engine-based summary information extraction method, apparatus and search engine Download PDF

Info

Publication number
WO2015196910A1
WO2015196910A1 PCT/CN2015/080676 CN2015080676W WO2015196910A1 WO 2015196910 A1 WO2015196910 A1 WO 2015196910A1 CN 2015080676 W CN2015080676 W CN 2015080676W WO 2015196910 A1 WO2015196910 A1 WO 2015196910A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
information
summary information
webpage resource
webpage
Prior art date
Application number
PCT/CN2015/080676
Other languages
French (fr)
Chinese (zh)
Inventor
董毅
张前川
陈营营
张川
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2015196910A1 publication Critical patent/WO2015196910A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the technical field of information retrieval, and in particular, to a search engine-based summary information extraction method, a search engine-based summary information extraction method, and a search engine.
  • search engine provides search results, in addition to the page title and URL, may also include providing a summary from the web page.
  • search engines generate summaries can be summarized as follows:
  • the digest thus formed is stored in the query subsystem, and once the relevant document is selected to match the query item, it is read back to the user.
  • this approach is the easiest for the query subsystem and does not require additional processing. But one of the biggest drawbacks of this approach is that the summary is not related to the query.
  • the dynamic summary method comes into being.
  • the dynamic summary is to extract the surrounding text according to the position of the query word in the document when responding to the query, and highlight the query word when displaying. This is the way most search engines currently use.
  • the content of the dynamic summary contains the user's query terms, these sentences do not express the central meaning of the entire Web document.
  • the user does not know whether the information he or she is looking for is included in this page by reading the summary returned by the search engine.
  • the user needs to click the search result to check whether the information that is desired is included in the webpage corresponding to the search result, and the multiple interaction process consumes bandwidth resources, and the search efficiency is low.
  • the present invention has been made in order to provide a search engine-based summary information extraction method and a corresponding search engine-based summary information extraction method and a corresponding method for overcoming the above problems or at least partially solving or alleviating the above problems.
  • kind of search engine is a search engine-based summary information extraction method and a corresponding search engine-based summary information extraction method and a corresponding method for overcoming the above problems or at least partially solving or alleviating the above problems.
  • a search engine-based summary information extraction method including:
  • the summary information is output.
  • a search engine-based summary information extracting apparatus including:
  • a webpage resource obtaining module configured to obtain a matching webpage resource based on a search string received in a search engine
  • a page type identification module configured to identify a page type of the webpage resource
  • a summary information extraction module configured to extract corresponding summary information from the webpage resource for the page type
  • An information output module adapted to output the summary information.
  • a search engine comprising:
  • a webpage resource obtaining module configured to obtain a matching webpage resource based on the received search string
  • a page type identification module configured to identify a page type of the webpage resource
  • a summary information extraction module configured to extract corresponding summary information from the webpage resource for the page type
  • An information output module adapted to output the summary information.
  • a computer program comprising computer readable code that, when executed on a computing device, causes the computing device to perform the search engine based summary information described above Extraction Method.
  • a computer readable medium is provided, wherein Computer program.
  • the search engine after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and the summary information outputted in the search result is by identifying the webpage resource.
  • the web page resources of different page types are extracted. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.
  • the corresponding cookie information is obtained according to the webpage resource, and the historical access record of the user is obtained according to the cookie information, and the webpage resource is accessed from the historical access record.
  • Element information whose number is greater than the first threshold is used as summary information. Therefore, the summary information displayed in the search result is personalized summary information for different users, and the user experience is improved, and the information provided to the user in the summary information is more valuable, and the user can obtain the desired information from the summary information.
  • the information reduces the occurrence of the user searching for the required information by frequently clicking the page corresponding to the search result, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and improving the data processing rate.
  • FIG. 1 schematically illustrates a search engine based digest in accordance with one embodiment of the present invention.
  • FIG. 2 is a flow chart showing the steps of a second embodiment of a method for extracting summary information based on a search engine according to an embodiment of the present invention
  • FIG. 2-a is a schematic diagram showing a download text page of Embodiment 2 of a search engine-based summary information extraction method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing a first output result of Embodiment 2 of a summary information extraction method based on a search engine according to an embodiment of the present invention
  • FIG. 3 is a flow chart showing the steps of a third embodiment of a method for extracting summary information based on a search engine according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram showing a top page of a video website according to Embodiment 3 of a method for extracting summary information based on a search engine according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram showing a second output result of Embodiment 3 of a summary information extraction method based on a search engine according to an embodiment of the present invention
  • FIG. 4 is a flow chart showing the steps of a method for extracting summary information based on a search engine according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram showing a top page of a video website according to Embodiment 4 of a method for extracting summary information based on a search engine according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram showing a third output result of Embodiment 4 of a search engine-based summary information extraction method according to an embodiment of the present invention
  • FIG. 5 is a flow chart showing the steps of Embodiment 5 of a summary information extraction method based on a search engine according to an embodiment of the present invention
  • FIG. 6 is a block diagram showing the structure of an embodiment of a summary information extracting apparatus based on a search engine according to an embodiment of the present invention
  • FIG. 7 is a block diagram showing a structural diagram of an embodiment of a search engine according to an embodiment of the present invention.
  • Figure 8 shows schematically a block diagram of a computing device for performing the method according to the invention
  • Figure 9 shows schematically the procedure for maintaining or carrying out the method according to the invention.
  • the storage unit of the code The storage unit of the code.
  • FIG. 1 a flow chart of a first step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown.
  • the embodiment of the present invention may include the following steps:
  • Step 101 Obtain a matching webpage resource based on a search string received in a search engine.
  • Step 102 Identify a page type of the webpage resource.
  • Step 103 Extract corresponding summary information from the webpage resource for the page type.
  • Step 104 Output the summary information.
  • the search engine after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and the summary information outputted in the search result is by identifying the webpage resource.
  • the web page resources of different page types are extracted. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.
  • FIG. 2 a flow chart of a step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown.
  • the embodiment of the present invention may include the following steps:
  • Step 201 Acquire a matching webpage resource based on a search string received in a search engine, where the webpage resource includes a webpage source code;
  • the search string query is the search information entered by the user in the search engine interface to express the user's intention and request to search for web resources related thereto.
  • the webpage resource may include information such as a webpage text, a webpage URL address, a webpage source code constituting the webpage, and a link to and from the webpage.
  • Step 202 Identify a page type of the webpage resource, where the page type includes a single page;
  • the step 202 may include the following substeps:
  • Sub-step S11 extracting a page frame of the webpage resource, and calculating a page frame ID
  • the method for extracting the page frame of the webpage resource may be: extracting the page frame of the webpage according to the html language tag in the source code of the webpage, and only retaining the frame class tag in the html language tag, such as a frame, a table, etc., when extracting, Also keep the id, name, and class attributes and remove the remaining attributes. You can also identify the body of the page by punctuation and remove the body to get the page frame of the page.
  • the attributes in the page can be calculated according to the hash algorithm, and the hash value of the page frame is the page frame ID.
  • the frame class tag such as frame, table, and its id, name, and class attributes are hashed.
  • the calculation is performed and the result is the page frame ID. Since the same hash function is used, the page frame ID calculated by the same page frame is also the same.
  • Sub-step S12 if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode
  • the calculation method may adopt a machine automatic learning mechanism, such as using a support vector machine SVM (Support Vector Machine) to calculate the page frame mode.
  • SVM Support Vector Machine
  • the extracted page frame is input into the SVM for learning, that is, the html language tag key tag is matched to the page frame, and the html language tag key tags in the page frame of the same ID can be completely matched, so for the page with the same ID
  • the SVM outputs the page frame mode of the corresponding page frame.
  • Sub-step S13 matching the page frame mode with a page frame mode in a pre-generated database to identify the page type.
  • the pre-generated database stores a known type of page frame mode and weights of each web page feature in the mode, and adds corresponding weights to the page frame according to different categories, if the weight of the corresponding page is the highest, The page is the corresponding page type.
  • the page type in the embodiment of the present invention may include a single page, and/or a list page.
  • the single page is a page with a single page element, and may include one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a special page.
  • the page table page may include an audio and video list page.
  • Step 203 Extract one or more key element information from the webpage source code as summary information for the single page;
  • the summary information may include at least one or a combination of the following: an element URL of the one or more element information, an element identifier, an element picture, and an element text description information.
  • one or more key element information may be extracted according to the content in the html language tag in the webpage source code, and the html language tag may be Includes ⁇ a> tags (define hyperlinks whose attributes href attribute indicates the target of the link), ⁇ meta> tags (which provide meta-information about the page, such as descriptions and keyes for search engines and update frequency) Word), ⁇ span> tags (combining inline elements), ⁇ div> tags, ⁇ p> tags, ⁇ script> tags, ⁇ classs> tags, and more.
  • the corresponding element information can be obtained as summary information from the following code:
  • XX is the corresponding download object
  • the corresponding element information or summary information is: 56.6M
  • download address is: http://dldir1.XX.com/XXfile/XX/XX2013/ XX2013SP6/9305/XX2013SP6.exe.
  • Step 204 Output the summary information.
  • the summary information may be output in the preset position of the corresponding search result when the search result is output.
  • the download body page 200 has information such as a download object identifier 210, a download object description 220, a download address 1230, and a download address 2240.
  • the download object identifier may be XX software.
  • the official version, etc., the download object description can include software size, update time, software language, provider, software license, software rating, application platform, software function introduction and other information.
  • the user's main requirement is the download address, so the download address link in the page can be extracted by step 203, and the summary information of the search result is displayed, so that the user can obtain the download address directly from the summary information. Downloading the download object does not require entering the page where the search result is located to find the download address, and the output summary information is shown in the first output result diagram of FIG. 2-b.
  • the search engine after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and after identifying the page type of the webpage resource, the webpage resource for a single page. , extract the corresponding summary information from the source code. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.
  • FIG. 3 a flowchart of a step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown.
  • the embodiment of the present invention may include the following steps:
  • Step 301 Acquire a matching webpage resource based on a search string received in a search engine, where the webpage resource includes a webpage source code;
  • Step 302 Identify a page type of the webpage resource, where the page type includes a list page;
  • the step 302 may comprise the following sub-steps:
  • Sub-step S21 extracting a page frame of the webpage resource, and calculating a page frame ID
  • Sub-step S22 if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode
  • Sub-step S23 matching the page frame mode with the page frame mode in the pre-generated database to identify the page type.
  • the page type in the embodiment of the present invention may include a single page, and/or a list page.
  • the list page is a page with more page elements, and may include a list page such as an audio and video home page.
  • Step 303 Extract, from the webpage source code, one or more element information in which the click rate of the webpage resource is sorted, as the summary information, from the webpage source code.
  • the summary information may include at least one or a combination of the following: an element URL of the one or more element information, an element identifier, an element picture, and an element text description information.
  • the click rate data (such as a video leaderboard) counted by the webpage may be obtained according to the content in the html language tag in the source code of the webpage. And then extract one or more sorted element information from the click rate data as summary information, and the html language tag may include an ⁇ a> tag (defining a hyperlink whose attribute href attribute indicates the target of the link), ⁇ meta> Tags (which provide meta-information about the page, such as descriptions and keywords for search engines and update frequency), ⁇ span> tags (combining inline elements), ⁇ div> tags, ⁇ p> tags, ⁇ script> tags, ⁇ classs> tags, and more.
  • the corresponding element information can be obtained from the following code as summary information:
  • each element information may include at least one or more of the following attributes: element URL, element identification, element picture, element text description information. Therefore, for the above example, the playback URL, name, picture and other information of the sharp XX DVD version can be given in the summary information.
  • Step 304 Output the summary information.
  • the one or more element information may be displayed in the search result in the form of a carousel.
  • the home page of the video website shown in FIG. 3-a may include a video category list 310, a video of each video category, and a corresponding leaderboard (such as category 1 leaderboard 320).
  • the video category list may include TV dramas, movies, variety shows, music, animation, travel, etc., such as category 1 330 is a TV series
  • video A to video F are various TV drama programs
  • category 1 leaderboard can In order, it is video A, video B, video D, video F, and so on.
  • the video programs in the video website 300 are in the top of the ranking list (such as the first two, and the specific number can be set as needed, and the embodiment of the present invention does not need to be limited thereto).
  • the video A, the video B, and the like displayed in the summary information may include a name, a play URL, a picture, and/or a text description of the corresponding video.
  • the search engine after receiving the search string input by the user, the search engine searches for all the webpage resources including the search string as the matched webpage resource, and after identifying the page type of the webpage resource, the webpage resource for the listpage. , extract the corresponding summary information from the source code. So that the summary information displayed in the search results expresses the entire page document The accuracy of the central meaning is higher, the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the occurrence of the user searching for the required information by frequently clicking the page corresponding to the search result. In turn, the retrieval speed is improved, the number of interactions of the search engine is reduced, and the data processing rate is improved.
  • FIG. 4 a flow chart of a method for extracting a digest information based on a search engine according to an embodiment of the present invention is shown.
  • the embodiment of the present invention may include the following steps:
  • Step 401 Acquire a matching webpage resource based on a search string received in a search engine
  • Step 402 Identify a page type of the webpage resource.
  • the step 402 may comprise the following sub-steps:
  • Sub-step S32 if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode
  • Sub-step S33 the page frame mode is matched with the page frame mode in the pre-generated database to identify the page type.
  • the search engine after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and after identifying the page type of the webpage resource, for different page types, Corresponding cookie information is obtained according to the webpage resource, and the user's historical access record is obtained according to the cookie information, and the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as the summary information. Therefore, the summary information displayed in the search result is personalized summary information for different users, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the user's frequent clicks on the search result. Corresponding pages are used to find the required information, which improves the retrieval speed, reduces the number of interactions between search engines, and increases the data processing rate.
  • the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
  • a tag adding module is adapted to add a specific tag TAG to the digest information.
  • the search engine may include the following modules.
  • the webpage resource obtaining module 701 is adapted to obtain a matching webpage resource based on the received search string
  • the information output module 704 is adapted to output the summary information.
  • the page type identification module 702 is further adapted to:
  • the single page may include one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a feature page.
  • the list page may include an audio and video list page.
  • the summary information extraction module 703 is further adapted to:
  • the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
  • the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
  • the embodiment of the present invention may further include:
  • the summary information extraction module 703 is further adapted to:
  • the summary information may include at least one or a combination of one element: one or more element information, an element identifier, an element picture, and an element text description information.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • Those skilled in the art will appreciate that some or all of the components of the processing device based on the search engine based summary information extraction in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP). Or all features.
  • DSP digital signal processor
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • Figure 8 illustrates a computing device, such as a retrieval server, that can implement search engine based summary information extraction in accordance with the present invention.
  • the computing device conventionally includes a processor 810 and a computer program product or computer readable medium in the form of a memory 820.
  • the memory 820 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 820 has a memory space 830 for program code 831 for performing any of the method steps described above.
  • storage space 830 for program code may include various program code 831 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similar to the storage 820 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 831', ie, code readable by a processor, such as 810, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A search engine-based summary information extraction method, apparatus and search engine. The method comprises: obtaining matched page resources on the basis of a search character string received in the search engine (101); identifying the page type of the page resources (102); in regard of the page type, extracting corresponding summary information from the page resources (103); and outputting the summary information (104). The situation where a user looks for desired information by frequently clicking pages corresponding to search results can be reduced, and therefore the retrieval speed is improved, the times of interaction of the search engine is reduced and the data processing speed is increased.

Description

基于搜索引擎的摘要信息提取方法、装置以及搜索引擎Search engine-based summary information extraction method, device and search engine 技术领域Technical field
本发明涉及信息检索的技术领域,尤其涉及一种基于搜索引擎的摘要信息提取方法、一种基于搜索引擎的摘要信息提取方法的装置以及一种搜索引擎。The present invention relates to the technical field of information retrieval, and in particular, to a search engine-based summary information extraction method, a search engine-based summary information extraction method, and a search engine.
背景技术Background technique
在网络信息极大丰富的当今时代,搜索引擎已经成为用户对海量资源检索的必备工具。In today's era when network information is extremely rich, search engines have become an indispensable tool for users to search for massive resources.
为了增强搜索结果展示的效果,搜索引擎提供的搜索结果中,除了网页标题和URL外,还可以包括提供一段来自网页的摘要。目前,搜索引擎生成摘要的方式,可以归结为如下两种:In order to enhance the effect of the search result display, the search engine provides search results, in addition to the page title and URL, may also include providing a summary from the web page. Currently, the way search engines generate summaries can be summarized as follows:
一是静态方式,即独立于查询,按照某种规则,事先在预处理阶段从网页内容提取出一些文字,例如截取网页正文的开头512个字节(对应256个汉字),或者将每一个段落的第一个句子拼起来,等等。这样形成的摘要存放在查询子系统中,一旦相关文档被选中与查询项匹配,就读出返回给用户。显然,这种方式对查询子系统来说是最轻松的,不需要做另外的处理工作。但这种方式的一个最大的缺点是摘要和查询无关。One is the static method, that is, independent of the query, according to some rules, some text is extracted from the webpage content in the preprocessing stage in advance, for example, the first 512 bytes of the webpage text (corresponding to 256 Chinese characters), or each paragraph The first sentence is put together, and so on. The digest thus formed is stored in the query subsystem, and once the relevant document is selected to match the query item, it is read back to the user. Obviously, this approach is the easiest for the query subsystem and does not require additional processing. But one of the biggest drawbacks of this approach is that the summary is not related to the query.
用户希望摘要中能够突出显示和查询直接对应的文字,希望摘要中出现和他关心的文字相关的句子。因此,动态摘要方式应运而生,动态摘要即在响应查询的时候,根据查询词在文档中的位置,提取出周围的文字来,在显示时将查询词标亮。这是目前大多数搜索引擎采用的方式。The user wants to highlight and query the text directly corresponding to the abstract, and hope that the sentence related to the text he cares about appears in the abstract. Therefore, the dynamic summary method comes into being. The dynamic summary is to extract the surrounding text according to the position of the query word in the document when responding to the query, and highlight the query word when displaying. This is the way most search engines currently use.
虽然动态摘要的内容包含用户的查询词,但是这些句子并不能表达出整个Web文档的中心意思。也就是说,用户通过阅读搜索引擎返回的摘要并不能确定自己查找的信息是否包含在这个页面中。此时,用户需要点击搜索结果,从搜索结果对应的网页查看是否包含自己想要的信息,多次的交互过程耗费带宽资源,搜索效率低下。Although the content of the dynamic summary contains the user's query terms, these sentences do not express the central meaning of the entire Web document. In other words, the user does not know whether the information he or she is looking for is included in this page by reading the summary returned by the search engine. At this time, the user needs to click the search result to check whether the information that is desired is included in the webpage corresponding to the search result, and the multiple interaction process consumes bandwidth resources, and the search efficiency is low.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决或减缓上述问题的一种基于搜索引擎的摘要信息提取方法和相应的一种基于搜索引擎的摘要信息提取方法以及一种搜索引擎。In view of the above problems, the present invention has been made in order to provide a search engine-based summary information extraction method and a corresponding search engine-based summary information extraction method and a corresponding method for overcoming the above problems or at least partially solving or alleviating the above problems. Kind of search engine.
根据本发明的一个方面,提供了一种基于搜索引擎的摘要信息提取方法,包括:According to an aspect of the present invention, a search engine-based summary information extraction method is provided, including:
基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源;Obtain matching webpage resources based on the search string received in the search engine;
识别所述网页资源的页面类型;Identifying a page type of the webpage resource;
针对所述页面类型,从所述网页资源中提取对应的摘要信息;Extracting corresponding summary information from the webpage resource for the page type;
输出所述摘要信息。The summary information is output.
根据本发明的另一方面,提供了一种基于搜索引擎的摘要信息提取装置,包括:According to another aspect of the present invention, a search engine-based summary information extracting apparatus is provided, including:
网页资源获取模块,适于基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源;a webpage resource obtaining module, configured to obtain a matching webpage resource based on a search string received in a search engine;
页面类型识别模块,适于识别所述网页资源的页面类型;a page type identification module, configured to identify a page type of the webpage resource;
摘要信息提取模块,适于针对所述页面类型,从所述网页资源中提取对应的摘要信息;a summary information extraction module, configured to extract corresponding summary information from the webpage resource for the page type;
信息输出模块,适于输出所述摘要信息。An information output module adapted to output the summary information.
根据本发明的另一方面,提供了一种搜索引擎,包括:According to another aspect of the present invention, a search engine is provided, comprising:
网页资源获取模块,适于基于接收的搜索字符串,获取匹配的网页资源;a webpage resource obtaining module, configured to obtain a matching webpage resource based on the received search string;
页面类型识别模块,适于识别所述网页资源的页面类型;a page type identification module, configured to identify a page type of the webpage resource;
摘要信息提取模块,适于针对所述页面类型,从所述网页资源中提取对应的摘要信息;a summary information extraction module, configured to extract corresponding summary information from the webpage resource for the page type;
信息输出模块,适于输出所述摘要信息。An information output module adapted to output the summary information.
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行上述的基于搜索引擎的摘要信息提取方法。According to still another aspect of the present invention, a computer program is provided, comprising computer readable code that, when executed on a computing device, causes the computing device to perform the search engine based summary information described above Extraction Method.
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了上 述的计算机程序。According to still another aspect of the present invention, a computer readable medium is provided, wherein Computer program.
本发明的有益效果为:The beneficial effects of the invention are:
在本发明实施例中,搜索引擎接收到用户输入的搜索字符串后,查找所有包含搜索字符串的网页资源作为匹配的网页资源,在搜索结果中输出的摘要信息为通过识别所述网页资源的页面类型后,对不同页面类型的网页资源提取得到的。从而使得显示在搜索结果中的摘要信息表达整个页面文档的中心意思的准确性更高,提供给用户的信息更有价值,用户从摘要信息中就能获得想要的信息,减少了用户因频繁点击搜索结果对应的页面来查找所需信息的情况发生,进而提高了检索速度,降低了搜索引擎的交互次数,提高数据处理速率。In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and the summary information outputted in the search result is by identifying the webpage resource. After the page type, the web page resources of different page types are extracted. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.
另外,在本发明实施例中,获得匹配的网页资源后,依据网页资源获得对应的cookies信息,并依据cookies信息获得用户的历史访问记录,从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。从而使得显示在搜索结果中的摘要信息为针对不同用户的个性化摘要信息,提升用户体验的同时,使得摘要信息中提供给用户的信息更有价值,用户从摘要信息中就能获得想要的信息,减少了用户因频繁点击搜索结果对应的页面来查找所需信息的情况发生,进而提高了检索速度,降低了搜索引擎的交互次数,提高数据处理速率。In addition, in the embodiment of the present invention, after obtaining the matched webpage resource, the corresponding cookie information is obtained according to the webpage resource, and the historical access record of the user is obtained according to the cookie information, and the webpage resource is accessed from the historical access record. Element information whose number is greater than the first threshold is used as summary information. Therefore, the summary information displayed in the search result is personalized summary information for different users, and the user experience is improved, and the information provided to the user in the summary information is more valuable, and the user can obtain the desired information from the summary information. The information reduces the occurrence of the user searching for the required information by frequently clicking the page corresponding to the search result, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and improving the data processing rate.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘 要信息提取方法实施例一的步骤流程图;FIG. 1 schematically illustrates a search engine based digest in accordance with one embodiment of the present invention. A flow chart of the steps of the first embodiment of the information extraction method;
图2示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例二的步骤流程图;FIG. 2 is a flow chart showing the steps of a second embodiment of a method for extracting summary information based on a search engine according to an embodiment of the present invention; FIG.
图2-a示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例二的下载正文页面示意图;FIG. 2-a is a schematic diagram showing a download text page of Embodiment 2 of a search engine-based summary information extraction method according to an embodiment of the present invention;
图2-b示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例二的第一输出结果示意图;FIG. 2 is a schematic diagram showing a first output result of Embodiment 2 of a summary information extraction method based on a search engine according to an embodiment of the present invention;
图3示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例三的步骤流程图;FIG. 3 is a flow chart showing the steps of a third embodiment of a method for extracting summary information based on a search engine according to an embodiment of the present invention; FIG.
图3-a示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例三的视频网站首页示意图;FIG. 3 is a schematic diagram showing a top page of a video website according to Embodiment 3 of a method for extracting summary information based on a search engine according to an embodiment of the present invention;
图3-b示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例三的第二输出结果示意图;FIG. 3 is a schematic diagram showing a second output result of Embodiment 3 of a summary information extraction method based on a search engine according to an embodiment of the present invention;
图4示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例四的步骤流程图;FIG. 4 is a flow chart showing the steps of a method for extracting summary information based on a search engine according to an embodiment of the present invention; FIG.
图4-a示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例四的视频网站首页示意图;FIG. 4 is a schematic diagram showing a top page of a video website according to Embodiment 4 of a method for extracting summary information based on a search engine according to an embodiment of the present invention;
图4-b示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例四的第三输出结果示意图;FIG. 4 is a schematic diagram showing a third output result of Embodiment 4 of a search engine-based summary information extraction method according to an embodiment of the present invention;
图5示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例五的步骤流程图;FIG. 5 is a flow chart showing the steps of Embodiment 5 of a summary information extraction method based on a search engine according to an embodiment of the present invention; FIG.
图6示意性示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取装置实施例的结构框图;6 is a block diagram showing the structure of an embodiment of a summary information extracting apparatus based on a search engine according to an embodiment of the present invention;
图7示意性示出了根据本发明一个实施例的一种搜索引擎实施例的结构框图;FIG. 7 is a block diagram showing a structural diagram of an embodiment of a search engine according to an embodiment of the present invention; FIG.
图8示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及Figure 8 shows schematically a block diagram of a computing device for performing the method according to the invention;
图9示意性地示出了用于保持或者携带实现根据本发明的方法的程 序代码的存储单元。Figure 9 shows schematically the procedure for maintaining or carrying out the method according to the invention. The storage unit of the code.
具体实施例Specific embodiment
下面结合附图和具体的实施方式对本发明作进一步的描述。The invention is further described below in conjunction with the drawings and specific embodiments.
参照图1,示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例一的步骤流程图,本发明实施例可以包括如下步骤:Referring to FIG. 1 , a flow chart of a first step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:
步骤101,基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源;Step 101: Obtain a matching webpage resource based on a search string received in a search engine.
步骤102,识别所述网页资源的页面类型;Step 102: Identify a page type of the webpage resource.
步骤103,针对所述页面类型,从所述网页资源中提取对应的摘要信息;Step 103: Extract corresponding summary information from the webpage resource for the page type.
步骤104,输出所述摘要信息。Step 104: Output the summary information.
在本发明实施例中,搜索引擎接收到用户输入的搜索字符串后,查找所有包含搜索字符串的网页资源作为匹配的网页资源,在搜索结果中输出的摘要信息为通过识别所述网页资源的页面类型后,对不同页面类型的网页资源提取得到的。从而使得显示在搜索结果中的摘要信息表达整个页面文档的中心意思的准确性更高,提供给用户的信息更有价值,用户从摘要信息中就能获得想要的信息,减少了用户因频繁点击搜索结果对应的页面来查找所需信息的情况发生,进而提高了检索速度,降低了搜索引擎的交互次数,提高数据处理速率。In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and the summary information outputted in the search result is by identifying the webpage resource. After the page type, the web page resources of different page types are extracted. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.
参照图2,示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例二的步骤流程图,本发明实施例可以包括如下步骤:Referring to FIG. 2, a flow chart of a step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:
步骤201,基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源,所述网页资源包括网页源代码;Step 201: Acquire a matching webpage resource based on a search string received in a search engine, where the webpage resource includes a webpage source code;
搜索字符串query是用户在搜索引擎界面中输入的搜索信息,用以表达用户意图,请求搜索与之相关的网页资源。The search string query is the search information entered by the user in the search engine interface to express the user's intention and request to search for web resources related thereto.
搜索引擎接收到用户输入的搜索字符串后,对搜索字符串进行分词、 去停止词、错别字判断等处理后,从预先建立的索引数据库中查找所有包含搜索字符串的网页资源作为匹配的网页资源。其中,网页资源可以包括网页正文、网页的URL地址、构成网页的网页源代码以及进出网页的链接等信息。After the search engine receives the search string input by the user, the search string is segmented, After the processing of the stop word, the typo judgment, and the like, all the web resources containing the search string are searched from the pre-established index database as matching web resources. The webpage resource may include information such as a webpage text, a webpage URL address, a webpage source code constituting the webpage, and a link to and from the webpage.
步骤202,识别所述网页资源的页面类型,所述页面类型包括单一页面;Step 202: Identify a page type of the webpage resource, where the page type includes a single page;
获取网页资源后,可以进一步根据该网页资源识别对应的页面类型,在本发明的一种优选实施例中,所述步骤202可以包括如下子步骤:After obtaining the webpage resource, the corresponding page type may be further identified according to the webpage resource. In a preferred embodiment of the present invention, the step 202 may include the following substeps:
子步骤S11,抽取所述网页资源的页面框架,计算页面框架ID;Sub-step S11, extracting a page frame of the webpage resource, and calculating a page frame ID;
在具体实现中,抽取网页资源的页面框架的方式可以为:根据网页源代码中的html语言标签抽取网页的页面框架,抽取时只保留html语言标签中的框架类标记,如frame、table等,同时保留id、name、class属性,去掉其余属性。还可以按标点识别出网页正文,去除正文以得到网页的页面框架。In a specific implementation, the method for extracting the page frame of the webpage resource may be: extracting the page frame of the webpage according to the html language tag in the source code of the webpage, and only retaining the frame class tag in the html language tag, such as a frame, a table, etc., when extracting, Also keep the id, name, and class attributes and remove the remaining attributes. You can also identify the body of the page by punctuation and remove the body to get the page frame of the page.
抽取页面框架后,可以将页面内的属性根据哈希算法计算页面框架的hash值,即为页面框架ID,例如,将框架类标记如frame、table及其id、name、class属性按哈希算法进行计算,所得结果即为页面框架ID。由于采用相同的哈希函数,相同的页面框架计算出的页面框架ID也是相同的。After extracting the page frame, the attributes in the page can be calculated according to the hash algorithm, and the hash value of the page frame is the page frame ID. For example, the frame class tag such as frame, table, and its id, name, and class attributes are hashed. The calculation is performed and the result is the page frame ID. Since the same hash function is used, the page frame ID calculated by the same page frame is also the same.
子步骤S12,若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;Sub-step S12, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
在实际中,计算页面框架模式时分标题、时间、网页正文等分别计算,计算方法可以采用机器自动学习机制,如采用支持向量机SVM(Support Vector Machine)计算页面框架模式。学习时将上述抽取的页面框架输入SVM进行学习,即对页面框架进行html语言标签关键标记的匹配,若干相同ID的页面框架中的html语言标签关键标记能够完全匹配,因此,对于相同ID的页面框架学习到上述预设阈值的的数量后,SVM便输出相应页面框架的页面框架模式。 In practice, when calculating the page frame mode, the title, time, and web page text are respectively calculated, and the calculation method may adopt a machine automatic learning mechanism, such as using a support vector machine SVM (Support Vector Machine) to calculate the page frame mode. During the learning, the extracted page frame is input into the SVM for learning, that is, the html language tag key tag is matched to the page frame, and the html language tag key tags in the page frame of the same ID can be completely matched, so for the page with the same ID After the framework learns the number of the preset thresholds, the SVM outputs the page frame mode of the corresponding page frame.
子步骤S13,将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。Sub-step S13, matching the page frame mode with a page frame mode in a pre-generated database to identify the page type.
其中,预先生成的数据库中存储有已知类型页面框架模式以及该模式下各网页特征的权重,对匹配上的特征按照不同的类别为页面框架增加相应权重,若对应页面的权重最高,则该页面为对应的页面类型。The pre-generated database stores a known type of page frame mode and weights of each web page feature in the mode, and adds corresponding weights to the page frame according to different categories, if the weight of the corresponding page is the highest, The page is the corresponding page type.
本发明实施例中的页面类型可以包括单一页面,和/或列表页面。其中,所述单一页面为页面元素比较单一的页面,可以包括以下一种或几种的组合:下载正文页面、音视频播放页面、小说阅读页面、问答页面、新闻组图页面、专题页面。所述页表页面可以包括音视频列表页面。The page type in the embodiment of the present invention may include a single page, and/or a list page. The single page is a page with a single page element, and may include one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a special page. The page table page may include an audio and video list page.
步骤203,针对所述单一页面,从所述网页源代码中提取一个或多个关键的元素信息,作为摘要信息;Step 203: Extract one or more key element information from the webpage source code as summary information for the single page;
其中,摘要信息至少可以包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。The summary information may include at least one or a combination of the following: an element URL of the one or more element information, an element identifier, an element picture, and an element text description information.
在具体实现中,如果与搜索字符串匹配的网页资源的页面类型为单一页面,可以根据网页源代码中的html语言标签中的内容来提取一个或多个关键的元素信息,而html语言标签可以包括<a>标签(定义超链接,其属性href属性指示链接的目标)、<meta>标签(可提供有关页面的元信息(meta-information),比如针对搜索引擎和更新频度的描述和关键词)、<span>标签(组合行内元素)、<div>标签、<p>标签、<script>标签、<classs>标签等等。例如,对于一个下载正文页面,可以从下述代码中获得对应的元素信息作为摘要信息:In a specific implementation, if the page type of the webpage resource matching the search string is a single page, one or more key element information may be extracted according to the content in the html language tag in the webpage source code, and the html language tag may be Includes <a> tags (define hyperlinks whose attributes href attribute indicates the target of the link), <meta> tags (which provide meta-information about the page, such as descriptions and keyes for search engines and update frequency) Word), <span> tags (combining inline elements), <div> tags, <p> tags, <script> tags, <classs> tags, and more. For example, for a download body page, the corresponding element information can be obtained as summary information from the following code:
<div class=″toolBottom″><div class=“toolBottom”>
<div class=″txtLogo″></div><div class=”txtLogo”></div>
<p class=″toolInfo″>56.6M|更新日期2014/01/03</p><p class=“toolInfo”>56.6M|Update date 2014/01/03</p>
<p class=″roundIcon″><a href=″intro.shtml″ target=″_blank″class=″link″title=″功能动画展示″>功能动画展示</a></p><p class=“roundIcon”><a href=”intro.shtml” target=”_blank”class=“link” title=“Functional Animation Display”>Functional Animation Display</a></p>
<a<a
href=″http://dldirl.XX.com/XXfile/XX/XX2013/XX2013SP6/9305/XX2013 SP6.exe″class=″downBtn″title=″立即下载″onclick=″tcssClick&&tcssClick(′downXX′)″>立即下载</a>Href=”http://dldirl.XX.com/XXfile/XX/XX2013/XX2013SP6/9305/XX2013 SP6.exe"class="downBtn"title="Download now" onclick="tcssClick&&tcssClick('downXX')">Download now</a>
</div></div>
其中,XX为对应的下载对象,则对应的元素信息或摘要信息为:56.6M|更新日期2014/01/03;下载地址为:http://dldir1.XX.com/XXfile/XX/XX2013/XX2013SP6/9305/XX2013SP6.exe。Where XX is the corresponding download object, the corresponding element information or summary information is: 56.6M| update date 2014/01/03; download address is: http://dldir1.XX.com/XXfile/XX/XX2013/ XX2013SP6/9305/XX2013SP6.exe.
步骤204,输出所述摘要信息。Step 204: Output the summary information.
获得网页资源对应的摘要信息后,则可以在搜索结果输出时在相应的搜索结果预设的位置中输出摘要信息。After obtaining the summary information corresponding to the webpage resource, the summary information may be output in the preset position of the corresponding search result when the search result is output.
例如,如图2-a所示的下载正文页面示意图,下载正文页面200中具有下载对象标识210、下载对象描述220、下载地址1230以及下载地址2240等信息,其中,下载对象标识可以为XX软件正式版等,下载对象描述可以包括软件大小、更新时间、软件语言、提供商、软件授权、软件评级、应用平台、软件功能简介等信息。在该下载正文页面200中,用户主需求是下载地址,所以可以通过步骤203把页面中的下载地址链接提取出来,展现在搜索结果的摘要信息,这样用户直接从摘要信息中就可以获得下载地址进行下载对象的下载,无需进入该搜索结果所在的页面来查找下载地址,输出的摘要信息如图2-b的第一输出结果示意图所示。For example, as shown in FIG. 2-a, the download body page 200 has information such as a download object identifier 210, a download object description 220, a download address 1230, and a download address 2240. The download object identifier may be XX software. The official version, etc., the download object description can include software size, update time, software language, provider, software license, software rating, application platform, software function introduction and other information. In the download body page 200, the user's main requirement is the download address, so the download address link in the page can be extracted by step 203, and the summary information of the search result is displayed, so that the user can obtain the download address directly from the summary information. Downloading the download object does not require entering the page where the search result is located to find the download address, and the output summary information is shown in the first output result diagram of FIG. 2-b.
在本发明实施例中,搜索引擎接收到用户输入的搜索字符串后,查找所有包含搜索字符串的网页资源作为匹配的网页资源,识别所述网页资源的页面类型后,针对单一页面的网页资源,从源代码中提取对应的摘要信息。从而使得显示在搜索结果中的摘要信息表达整个页面文档的中心意思的准确性更高,提供给用户的信息更有价值,用户从摘要信息中就能获得想要的信息,减少了用户因频繁点击搜索结果对应的页面来查找所需信息的情况发生,进而提高了检索速度,降低了搜索引擎的交互次数,提高数据处理速率。 In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and after identifying the page type of the webpage resource, the webpage resource for a single page. , extract the corresponding summary information from the source code. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.
参照图3,示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例三的步骤流程图,本发明实施例可以包括如下步骤:Referring to FIG. 3, a flowchart of a step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:
步骤301,基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源,所述网页资源包括网页源代码;Step 301: Acquire a matching webpage resource based on a search string received in a search engine, where the webpage resource includes a webpage source code;
步骤302,识别所述网页资源的页面类型,所述页面类型包括列表页面;Step 302: Identify a page type of the webpage resource, where the page type includes a list page;
在本发明的一种优选实施例中,所述步骤302可以包括如下子步骤:In a preferred embodiment of the invention, the step 302 may comprise the following sub-steps:
子步骤S21,抽取所述网页资源的页面框架,计算页面框架ID;Sub-step S21, extracting a page frame of the webpage resource, and calculating a page frame ID;
子步骤S22,若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;Sub-step S22, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
子步骤S23,将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。Sub-step S23, matching the page frame mode with the page frame mode in the pre-generated database to identify the page type.
本发明实施例中的页面类型可以包括单一页面,和/或列表页面。其中,所述列表页面为页面元素比较多的页面,可以包括音视频首页等列表页面。The page type in the embodiment of the present invention may include a single page, and/or a list page. The list page is a page with more page elements, and may include a list page such as an audio and video home page.
步骤303,针对所述列表页面,从所述网页源代码中提取所述网页资源统计出的点击率排序在前的一个或多个元素信息,作为摘要信息;Step 303: Extract, from the webpage source code, one or more element information in which the click rate of the webpage resource is sorted, as the summary information, from the webpage source code.
其中,摘要信息至少可以包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。The summary information may include at least one or a combination of the following: an element URL of the one or more element information, an element identifier, an element picture, and an element text description information.
在具体实现中,如果与搜索字符串匹配的网页资源的页面类型为列表页面,可以根据网页源代码中的html语言标签中的内容来获得网页统计出的点击率数据(如视频排行榜等),然后从点击率数据中提取一个或多个排序在前的元素信息作为摘要信息,而html语言标签可以包括<a>标签(定义超链接,其属性href属性指示链接的目标)、<meta>标签(可提供有关页面的元信息(meta-information),比如针对搜索引擎和更新频度的描述和关键词)、<span>标签(组合行内元素)、<div>标签、<p>标签、<script>标签、<classs>标签等等。例如,对于视频网站首页页面, 可以从下述代码中获得对应的元素信息作为摘要信息:In a specific implementation, if the page type of the webpage resource matching the search string is a list page, the click rate data (such as a video leaderboard) counted by the webpage may be obtained according to the content in the html language tag in the source code of the webpage. And then extract one or more sorted element information from the click rate data as summary information, and the html language tag may include an <a> tag (defining a hyperlink whose attribute href attribute indicates the target of the link), <meta> Tags (which provide meta-information about the page, such as descriptions and keywords for search engines and update frequency), <span> tags (combining inline elements), <div> tags, <p> tags, <script> tags, <classs> tags, and more. For example, for the video site home page, The corresponding element information can be obtained from the following code as summary information:
<div class=″item″><div class=“item”>
<label class=″hot″>1</label><label class=“hot”>1</label>
<a class=″name″ target=″_blank″href=″http://v.youku.com/v_show/id_XNzIxNzc0NTUy.html″ data-from=″1-1″>犀利XX DVD版</a><a class=“name” target=”_blank”href=“http://v.youku.com/v_show/id_XNzIxNzc0NTUy.html” data-from=”1-1”> Sharp XX DVD Edition</a>
</div></div>
则摘要信息中显示排在第一位的元素信息是犀利XX DVD版。在实际中,每个元素信息至少可以包括如下属性中的一种或多种:元素URL,元素标识,元素图片,元素文字描述信息。因此,针对上例,在摘要信息中可以给出犀利XX DVD版的播放URL、名称、图片等信息。The information in the summary information showing the first place is the sharp XX DVD version. In practice, each element information may include at least one or more of the following attributes: element URL, element identification, element picture, element text description information. Therefore, for the above example, the playback URL, name, picture and other information of the sharp XX DVD version can be given in the summary information.
步骤304,输出所述摘要信息。Step 304: Output the summary information.
需要说明的是,在输出摘要信息时,所述一个或多个元素信息可以以以轮播的形式展示在搜索结果中。It should be noted that, when the summary information is output, the one or more element information may be displayed in the search result in the form of a carousel.
例如,如图3-a所示的视频网站首页示意图,在视频网站首页300中,可以包括视频类目列表310、各个视频类目的视频以及对应的排行榜(如类目1排行榜320)等信息,其中,视频类目列表可以包括电视剧、电影、综艺、音乐、动漫、旅游等等,如类目1 330为电视剧,则视频A至视频F为各电视剧节目,类目1排行榜可以为顺次为视频A、视频B、视频D、视频F等等。则可以通过步骤303把该视频网站300中各类目节目在排行榜前n个(如前2个,具体个数可以按需设定,本发明实施例对此无需加以限制)视频展现在摘要中,如图3-b的第二输出结果示意图所示,其中展示在摘要信息中的视频A、视频B等可以包括对应视频的名称、播放URL、图片、和/或,文字描述等。For example, the home page of the video website shown in FIG. 3-a may include a video category list 310, a video of each video category, and a corresponding leaderboard (such as category 1 leaderboard 320). And other information, wherein the video category list may include TV dramas, movies, variety shows, music, animation, travel, etc., such as category 1 330 is a TV series, then video A to video F are various TV drama programs, category 1 leaderboard can In order, it is video A, video B, video D, video F, and so on. Then, in step 303, the video programs in the video website 300 are in the top of the ranking list (such as the first two, and the specific number can be set as needed, and the embodiment of the present invention does not need to be limited thereto). As shown in the second output result diagram of FIG. 3-b, the video A, the video B, and the like displayed in the summary information may include a name, a play URL, a picture, and/or a text description of the corresponding video.
在本发明实施例中,搜索引擎接收到用户输入的搜索字符串后,查找所有包含搜索字符串的网页资源作为匹配的网页资源,识别所述网页资源的页面类型后,针对列表页面的网页资源,从源代码中提取对应的摘要信息。从而使得显示在搜索结果中的摘要信息表达整个页面文档的 中心意思的准确性更高,提供给用户的信息更有价值,用户从摘要信息中就能获得想要的信息,减少了用户因频繁点击搜索结果对应的页面来查找所需信息的情况发生,进而提高了检索速度,降低了搜索引擎的交互次数,提高数据处理速率。In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all the webpage resources including the search string as the matched webpage resource, and after identifying the page type of the webpage resource, the webpage resource for the listpage. , extract the corresponding summary information from the source code. So that the summary information displayed in the search results expresses the entire page document The accuracy of the central meaning is higher, the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the occurrence of the user searching for the required information by frequently clicking the page corresponding to the search result. In turn, the retrieval speed is improved, the number of interactions of the search engine is reduced, and the data processing rate is improved.
参照图4,示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例四的步骤流程图,本发明实施例可以包括如下步骤:Referring to FIG. 4, a flow chart of a method for extracting a digest information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:
步骤401,基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源;Step 401: Acquire a matching webpage resource based on a search string received in a search engine;
步骤402,识别所述网页资源的页面类型;Step 402: Identify a page type of the webpage resource.
在本发明的一种优选实施例中,所述步骤402可以包括如下子步骤:In a preferred embodiment of the invention, the step 402 may comprise the following sub-steps:
子步骤S31,抽取所述网页资源的页面框架,计算页面框架ID;Sub-step S31, extracting a page frame of the webpage resource, and calculating a page frame ID;
子步骤S32,若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;Sub-step S32, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
子步骤S33,将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。Sub-step S33, the page frame mode is matched with the page frame mode in the pre-generated database to identify the page type.
步骤403,针对所述页面类型,从所述网页资源中提取对应的摘要信息;Step 403: Extract corresponding summary information from the webpage resource for the page type.
本发明实施例可以根据用户对匹配的网页资源的历史访问记录,在摘要信息中展现与历史访问记录相关的元素信息,具体可以为:The embodiment of the present invention may display the element information related to the historical access record in the summary information according to the history access record of the matched web resource by the user, which may be:
在本发明的一种优选实施例中,步骤403可以包括如下子步骤:In a preferred embodiment of the invention, step 403 can include the following sub-steps:
子步骤S41,针对所述页面类型,向所述网页资源对应的网站对象发送第一查询请求;Sub-step S41, sending, to the webpage object corresponding to the webpage resource, a first query request for the page type;
子步骤S42,接收所述网站对象发送的与所述第一查询请求对应的历史访问记录,所述历史访问记录为所述网站对象从当前终端中获得cookies信息后,依据所述cookies信息获得的记录;Sub-step S42, receiving a historical access record corresponding to the first query request sent by the website object, where the historical access record is obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information. recording;
子步骤S43,从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。 Sub-step S43, the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
具体的,若与搜索字符串query匹配的网页资源属于某个网站对象,则搜索引擎可以向该网站对象发出第一查询请求,所述第一查询请求为告知该网站对象有用户查询的请求。网站对象接收到第一查询请求后,从当前终端中获得对应的cookies信息,并依据该cookies信息获得当前用户的历史访问记录,反馈给搜索引擎,搜索引擎依据接收到的历史访问记录,获取所述网页资源中访问次数大于第一阈值的元素信息作为摘要信息,从而为用户提供个性化的摘要信息。其中,第一阈值可以为1或其他整数值,本发明实施例对此无需加以限制。Specifically, if the webpage resource matching the search string query belongs to a certain website object, the search engine may issue a first query request to the website object, and the first query request is a request for informing the website object that the user has a query. After receiving the first query request, the website object obtains the corresponding cookie information from the current terminal, and obtains the current user's historical access record according to the cookie information, and feeds back to the search engine, and the search engine obtains the location according to the received historical access record. The element information of the webpage resource whose access times are greater than the first threshold is used as the digest information, thereby providing the user with personalized digest information. The first threshold may be 1 or other integer values, which is not limited in this embodiment of the present invention.
在本发明的另一种优选实施例中,步骤403可以包括如下子步骤:In another preferred embodiment of the invention, step 403 can include the following sub-steps:
子步骤S51,针对所述页面类型,向当前终端的浏览器发出第二查询请求,所述第二查询请求包括所述网页资源的网站对象标识;Sub-step S51, sending, to the browser of the current terminal, a second query request for the page type, where the second query request includes a website object identifier of the webpage resource;
子步骤S52,接收所述浏览器返回的当前终端中与所述网站对象标识相关的历史访问记录,所述历史访问记录为当前终端的浏览器获取与所述网站对象相关的cookies信息后获得;Sub-step S52, receiving a historical access record related to the website object identifier in the current terminal returned by the browser, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;
子步骤S53,从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。Sub-step S53: Obtain, from the historical access record, element information of the webpage resource whose access times are greater than a first threshold, as summary information.
具体的,若与搜索字符串query匹配的网页资源属于某个网站对象,则搜索引擎可以向当前终端的浏览器发出第二查询请求,以请求当前终端的浏览器调取用户访问该网站对象的cookies信息。当前终端的浏览器接收到第二查询请求后,从当前终端中获得与网站对象的标识对应的cookies信息,并依据该cookies信息获得当前用户的历史访问记录,反馈给搜索引擎,搜索引擎依据接收到的历史访问记录,获取所述网页资源中访问次数大于第一阈值的元素信息作为摘要信息,从而为用户提供个性化的摘要信息。Specifically, if the webpage resource matching the search string query belongs to a certain website object, the search engine may issue a second query request to the browser of the current terminal to request the browser of the current terminal to retrieve the user's access to the website object. Cookie information. After receiving the second query request, the browser of the current terminal obtains the cookie information corresponding to the identifier of the website object from the current terminal, and obtains the historical access record of the current user according to the cookie information, and feeds back to the search engine, and the search engine receives the information according to the receipt. The obtained historical access record obtains element information of the webpage resource whose access times are greater than the first threshold as the digest information, thereby providing the user with personalized digest information.
步骤404,对所述摘要信息添加特定标记TAG; Step 404, adding a specific tag TAG to the summary information;
在本发明实施例中,根据用户的历史访问记录提取个性化的摘要信息后,还可以对该个性化的摘要信息添加特定标记TAG,如为该个性化的摘要信息打上推荐标记。 In the embodiment of the present invention, after the personalized summary information is extracted according to the historical access record of the user, the specific identifier TAG may be added to the personalized summary information, for example, the personalized summary information is marked with a recommendation mark.
步骤405,输出所述添加了特定标记TAG的摘要信息。 Step 405, output the summary information added with the specific tag TAG.
在具体实现中,摘要信息至少包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。In a specific implementation, the summary information includes at least one or a combination of the following: an element URL of one or more element information, an element identifier, an element picture, and an element text description information.
例如,如图4-a所示的视频网站首页示意图,在视频网站首页400中,可以包括视频类目列表410、各个视频类目的视频以及对应的排行榜(如类目1排行榜420)等信息,其中,视频类目列表可以包括电视剧、电影、综艺、音乐、动漫、旅游等等,如类目1 430为电视剧,则视频A至视频F为各电视剧节目,类目1排行榜可以为顺次为视频A、视频B、视频D、视频F等等。通过步骤403可以获得用户对该视频网站400的历史访问记录,如获得用户查看过该视频网站的视频有视频E、视频F,则把用户查看过的视频打上“优”等标记(具体的标记内容可以按需设定,本发明实施例对此无需加以限制),展现在摘要中,如图4-b的第三输出结果示意图所示。其中展示在摘要信息中的视频A、视频B等可以包括对应视频的名称、播放URL、图片、和/或,文字描述等。For example, the home page of the video website shown in FIG. 4-a may include a video category list 410, a video of each video category, and a corresponding leaderboard (such as category 1 leaderboard 420). And other information, wherein the video category list may include TV dramas, movies, variety shows, music, animation, travel, etc., if category 1 430 is a TV series, then video A to video F are various TV drama programs, category 1 leaderboard can In order, it is video A, video B, video D, video F, and so on. The historical access record of the user to the video website 400 can be obtained by step 403. If the video of the video website has been viewed by the user, the video E and the video F are displayed, and the video viewed by the user is marked with “excellent” (specific mark). The content can be set as needed, which is not limited by the embodiment of the present invention, and is shown in the abstract, as shown in the third output result diagram of FIG. 4-b. The video A, the video B, and the like displayed in the summary information may include a name, a play URL, a picture, and/or a text description of the corresponding video.
在本发明实施例中,搜索引擎接收到用户输入的搜索字符串后,查找所有包含搜索字符串的网页资源作为匹配的网页资源,识别所述网页资源的页面类型后,针对不同的页面类型,依据网页资源获得对应的cookies信息,并依据cookies信息获得用户的历史访问记录,从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。从而使得显示在搜索结果中的摘要信息为针对不同用户的个性化摘要信息,提供给用户的信息更有价值,用户从摘要信息中就能获得想要的信息,减少了用户因频繁点击搜索结果对应的页面来查找所需信息的情况发生,进而提高了检索速度,降低了搜索引擎的交互次数,提高数据处理速率。In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and after identifying the page type of the webpage resource, for different page types, Corresponding cookie information is obtained according to the webpage resource, and the user's historical access record is obtained according to the cookie information, and the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as the summary information. Therefore, the summary information displayed in the search result is personalized summary information for different users, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the user's frequent clicks on the search result. Corresponding pages are used to find the required information, which improves the retrieval speed, reduces the number of interactions between search engines, and increases the data processing rate.
参照图5,示出了根据本发明一个实施例的一种基于搜索引擎的摘要信息提取方法实施例五的步骤流程图,本发明实施例可以包括如下步骤:Referring to FIG. 5, a flow chart of the steps of the fifth embodiment of the method for extracting summary information based on the search engine is illustrated. The embodiment of the present invention may include the following steps:
步骤501,基于在搜索引擎中接收的搜索字符串,获取匹配的网页资 源;Step 501: Obtain a matching webpage resource based on a search string received in a search engine. source;
步骤502,识别所述网页资源的页面类型;Step 502: Identify a page type of the webpage resource.
在本发明的一种优选实施例中,所述步骤502可以包括如下子步骤:In a preferred embodiment of the invention, the step 502 can include the following sub-steps:
子步骤S61,抽取所述网页资源的页面框架,计算页面框架ID;Sub-step S61, extracting a page frame of the webpage resource, and calculating a page frame ID;
子步骤S62,若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;Sub-step S62, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
子步骤S63,将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。Sub-step S63, the page frame mode is matched with the page frame mode in the pre-generated database to identify the page type.
步骤503,针对所述页面类型,从预先生成的摘要数据库中查找与所述网页资源对应的摘要信息,所述摘要数据库存储有网页资源与对应的摘要信息;Step 503: Search for, according to the page type, summary information corresponding to the webpage resource from a pre-generated digest database, where the digest database stores webpage resources and corresponding digest information;
具体而言,除了如上述实施例一至四所述的实时获取每个命中的网页资源的摘要信息外,本发明实施例还可以在蜘蛛抓取网页时预先提取每个网页资源的摘要信息,存储在摘要数据库中,并每隔预设时间段更新摘要数据库中的摘要信息,当命中某个网页资源时,从摘要数据库中获取与所述网页资源对应的摘要信息。Specifically, in addition to the summary information of each hit webpage resource obtained in real time as described in the foregoing first to fourth embodiments, the embodiment of the present invention may further extract the summary information of each webpage resource in advance when the spider crawls the webpage, and store In the summary database, the summary information in the summary database is updated every preset time period, and when a certain webpage resource is hit, the summary information corresponding to the webpage resource is obtained from the digest database.
步骤504,输出所述摘要信息。 Step 504, output the summary information.
其中,所述摘要信息至少包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。The summary information includes at least one or a combination of the following: an element URL of one or more element information, an element identifier, an element picture, and an element text description information.
在本发明实施例中,搜索引擎接收到用户输入的搜索字符串后,查找所有包含搜索字符串的网页资源作为匹配的网页资源,并通过预先生成的摘要数据库中查找与所述网页资源对应的摘要信息输出在搜索结果中,提高搜索速度,并且使得显示在搜索结果中的摘要信息表达整个页面文档的中心意思的准确性更高,提供给用户的信息更有价值,用户从摘要信息中就能获得想要的信息,减少了用户因频繁点击搜索结果对应的页面来查找所需信息的情况发生,进而降低了搜索引擎的交互次数,提高数据处理速率。In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all the webpage resources including the search string as the matched webpage resources, and searches for the webpage resource corresponding to the webpage resource through the pre-generated digest database. The summary information is output in the search results to improve the search speed, and the summary information displayed in the search results expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user is from the summary information. The information can be obtained, and the user can find the required information by frequently clicking the page corresponding to the search result, thereby reducing the number of interactions of the search engine and increasing the data processing rate.
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组 合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。For the method embodiment, for the sake of simple description, it is expressed as a series of action groups. It will be appreciated by those skilled in the art that the present invention is not limited by the order of the acts described, as some steps may be performed in other sequences or concurrently in accordance with the present invention. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
参照图6,示出了本发明一个实施例的一种基于搜索引擎的摘要信息提取装置实施例的结构框图,所述装置可以包括如下模块Referring to FIG. 6, a structural block diagram of an embodiment of a search engine-based summary information extracting apparatus according to an embodiment of the present invention is shown, and the apparatus may include the following modules.
网页资源获取模块601,适于基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源;The webpage resource obtaining module 601 is adapted to obtain a matching webpage resource based on the search string received in the search engine;
页面类型识别模块602,适于识别所述网页资源的页面类型;a page type identification module 602, configured to identify a page type of the webpage resource;
摘要信息提取模块603,适于针对所述页面类型,从所述网页资源中提取对应的摘要信息;The summary information extraction module 603 is adapted to extract corresponding summary information from the webpage resource for the page type;
信息输出模块604,适于输出所述摘要信息。The information output module 604 is adapted to output the summary information.
在本发明的一种优选实施例中,所述页面类型识别模块602还适于:In a preferred embodiment of the present invention, the page type identification module 602 is further adapted to:
抽取所述网页资源的页面框架,计算页面框架ID;Extracting a page frame of the webpage resource, and calculating a page frame ID;
若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;If the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。The page frame pattern is matched with the page frame pattern in the pre-generated database to identify the page type.
在本发明的一种优选实施例中,所述网页资源包括网页源代码,所述页面类型包括单一页面,所述摘要信息提取模块603还适于:In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a single page, and the digest information extraction module 603 is further adapted to:
针对所述单一页面,从所述网页源代码中提取一个或多个关键的元素信息,作为摘要信息。For the single page, one or more key element information is extracted from the webpage source code as summary information.
作为本发明实施例的一种优选示例,所述单一页面可以包括以下一种或几种的组合:下载正文页面、音视频播放页面、小说阅读页面、问答页面、新闻组图页面、专题页面。As a preferred example of the embodiment of the present invention, the single page may include one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a feature page.
在本发明的一种优选实施例中,所述网页资源包括网页源代码,所述页面类型包括列表页面,所述摘要信息提取模块603还适于: In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a list page, and the digest information extraction module 603 is further adapted to:
针对所述列表页面,从所述网页源代码中提取所述网页资源统计出的点击率排序在前的一个或多个元素信息,作为摘要信息。And extracting, from the webpage source code, the one or more element information in which the click rate calculated by the webpage resource is ranked as the summary information.
作为本发明实施例的一种优选示例,所述列表页面可以包括音视频列表页面。As a preferred example of an embodiment of the present invention, the list page may include an audio and video list page.
在本发明的一种优选实施例中,所述摘要信息提取模块603还适于:In a preferred embodiment of the present invention, the summary information extraction module 603 is further adapted to:
针对所述页面类型,向所述网页资源对应的网站对象发送第一查询请求;Sending, to the page type, a first query request to a website object corresponding to the webpage resource;
接收所述网站对象发送的与所述第一查询请求对应的历史访问记录,所述历史访问记录为所述网站对象从当前终端中获得cookies信息后,依据所述cookies信息获得的记录;Receiving, by the website object, a historical access record corresponding to the first query request, where the historical access record is a record obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information;
从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
在本发明的一种优选实施例中,所述摘要信息提取模块603还适于:In a preferred embodiment of the present invention, the summary information extraction module 603 is further adapted to:
针对所述页面类型,向当前终端的浏览器发出第二查询请求,所述第二查询请求包括所述网页资源的网站对象标识;Sending, to the browser of the current terminal, a second query request, where the second query request includes a website object identifier of the webpage resource;
接收所述浏览器返回的当前终端中与所述网站对象标识相关的历史访问记录,所述历史访问记录为当前终端的浏览器获取与所述网站对象相关的cookies信息后获得;Receiving, by the browser, a historical access record related to the website object identifier in the current terminal, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;
从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
在本发明的一种优选实施例中,本发明实施例还可以包括:In a preferred embodiment of the present invention, the embodiment of the present invention may further include:
标记添加模块,适于对所述摘要信息添加特定标记TAG。A tag adding module is adapted to add a specific tag TAG to the digest information.
在本发明的一种优选实施例中,所述摘要信息提取模块603还适于:In a preferred embodiment of the present invention, the summary information extraction module 603 is further adapted to:
针对所述页面类型,从预先生成的摘要数据库中查找与所述网页资源对应的摘要信息,所述摘要数据库存储有网页资源与对应的摘要信息。For the page type, the summary information corresponding to the webpage resource is searched from the pre-generated digest database, and the digest database stores the webpage resource and the corresponding digest information.
作为本发明实施例的一种优选示例,所述摘要信息至少可以包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。 As a preferred example of the embodiment of the present invention, the summary information may include at least one or a combination of one element: one or more element information, an element identifier, an element picture, and an element text description information.
参照图7,示出了本发明一个实施例的一种搜索引擎实施例的结构框图,所述搜索引擎可以包括如下模块Referring to FIG. 7, a structural block diagram of an embodiment of a search engine according to an embodiment of the present invention is shown. The search engine may include the following modules.
网页资源获取模块701,适于基于接收的搜索字符串,获取匹配的网页资源;The webpage resource obtaining module 701 is adapted to obtain a matching webpage resource based on the received search string;
页面类型识别模块702,适于识别所述网页资源的页面类型;a page type identification module 702, configured to identify a page type of the webpage resource;
摘要信息提取模块703,适于针对所述页面类型,从所述网页资源中提取对应的摘要信息;The summary information extraction module 703 is adapted to extract corresponding summary information from the webpage resource for the page type;
信息输出模块704,适于输出所述摘要信息。The information output module 704 is adapted to output the summary information.
在本发明的一种优选实施例中,所述页面类型识别模块702还适于:In a preferred embodiment of the present invention, the page type identification module 702 is further adapted to:
抽取所述网页资源的页面框架,计算页面框架ID;Extracting a page frame of the webpage resource, and calculating a page frame ID;
若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;If the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。The page frame pattern is matched with the page frame pattern in the pre-generated database to identify the page type.
在本发明的一种优选实施例中,所述网页资源包括网页源代码,所述页面类型包括单一页面,所述摘要信息提取模块703还适于:In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a single page, and the digest information extraction module 703 is further adapted to:
针对所述单一页面,从所述网页源代码中提取一个或多个关键的元素信息,作为摘要信息。For the single page, one or more key element information is extracted from the webpage source code as summary information.
作为本发明实施例的一种优选示例,所述单一页面可以包括以下一种或几种的组合:下载正文页面、音视频播放页面、小说阅读页面、问答页面、新闻组图页面、专题页面。As a preferred example of the embodiment of the present invention, the single page may include one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a feature page.
在本发明的一种优选实施例中,所述网页资源包括网页源代码,所述页面类型包括列表页面,所述摘要信息提取模块703还适于:In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a list page, and the digest information extraction module 703 is further adapted to:
针对所述列表页面,从所述网页源代码中提取所述网页资源统计出的点击率排序在前的一个或多个元素信息,作为摘要信息。And extracting, from the webpage source code, the one or more element information in which the click rate calculated by the webpage resource is ranked as the summary information.
作为本发明实施例的一种优选示例,所述列表页面可以包括音视频列表页面。 As a preferred example of an embodiment of the present invention, the list page may include an audio and video list page.
在本发明的一种优选实施例中,所述摘要信息提取模块703还适于:In a preferred embodiment of the present invention, the summary information extraction module 703 is further adapted to:
针对所述页面类型,向所述网页资源对应的网站对象发送第一查询请求;Sending, to the page type, a first query request to a website object corresponding to the webpage resource;
接收所述网站对象发送的与所述第一查询请求对应的历史访问记录,所述历史访问记录为所述网站对象从当前终端中获得cookies信息后,依据所述cookies信息获得的记录;Receiving, by the website object, a historical access record corresponding to the first query request, where the historical access record is a record obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information;
从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
在本发明的一种优选实施例中,所述摘要信息提取模块703还适于:In a preferred embodiment of the present invention, the summary information extraction module 703 is further adapted to:
针对所述页面类型,向当前终端的浏览器发出第二查询请求,所述第二查询请求包括所述网页资源的网站对象标识;Sending, to the browser of the current terminal, a second query request, where the second query request includes a website object identifier of the webpage resource;
接收所述浏览器返回的当前终端中与所述网站对象标识相关的历史访问记录,所述历史访问记录为当前终端的浏览器获取与所述网站对象相关的cookies信息后获得;Receiving, by the browser, a historical access record related to the website object identifier in the current terminal, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;
从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
在本发明的一种优选实施例中,本发明实施例还可以包括:In a preferred embodiment of the present invention, the embodiment of the present invention may further include:
标记添加模块,适于对所述摘要信息添加特定标记TAG。A tag adding module is adapted to add a specific tag TAG to the digest information.
在本发明的一种优选实施例中,所述摘要信息提取模块703还适于:In a preferred embodiment of the present invention, the summary information extraction module 703 is further adapted to:
针对所述页面类型,从预先生成的摘要数据库中查找与所述网页资源对应的摘要信息,所述摘要数据库存储有网页资源与对应的摘要信息。For the page type, the summary information corresponding to the webpage resource is searched from the pre-generated digest database, and the digest database stores the webpage resource and the corresponding digest information.
作为本发明实施例的一种优选示例,所述摘要信息至少可以包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。As a preferred example of the embodiment of the present invention, the summary information may include at least one or a combination of one element: one or more element information, an element identifier, an element picture, and an element text description information.
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置或搜索引擎实施例而言,由于其与方法实施例 基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in the present specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments can be referred to each other. For a device or search engine embodiment, due to its and method embodiments Basically similar, so the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的基于搜索引擎的摘要信息提取的处理设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the components of the processing device based on the search engine based summary information extraction in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP). Or all features. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图8示出了可以实现根据本发明的基于搜索引擎的摘要信息提取的计算设备,例如检索服务器。该计算设备传统上包括处理器810和以存储器820形式的计算机程序产品或者计算机可读介质。存储器820可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器820具有用于执行上述方法中的任何方法步骤的程序代码831的存储空间830。例如,用于程序代码的存储空间830可以包括分别用于实现上面的方法中的各种步骤的各个程序代码831。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图9所述的便携式或者固定存储单元。该存储单元可以具有与图8的计算设备中的存储器820类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码831’,即可以由例如诸如810之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。 For example, Figure 8 illustrates a computing device, such as a retrieval server, that can implement search engine based summary information extraction in accordance with the present invention. The computing device conventionally includes a processor 810 and a computer program product or computer readable medium in the form of a memory 820. The memory 820 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. Memory 820 has a memory space 830 for program code 831 for performing any of the method steps described above. For example, storage space 830 for program code may include various program code 831 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similar to the storage 820 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 831', ie, code readable by a processor, such as 810, that when executed by a computing device causes the computing device to perform each of the methods described above step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。&quot;an embodiment,&quot; or &quot;an embodiment,&quot; or &quot;an embodiment,&quot; In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。 In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (25)

  1. 一种基于搜索引擎的摘要信息提取方法,包括步骤:A method for extracting summary information based on a search engine, comprising the steps of:
    基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源;Obtain matching webpage resources based on the search string received in the search engine;
    识别所述网页资源的页面类型;Identifying a page type of the webpage resource;
    针对所述页面类型,从所述网页资源中提取对应的摘要信息;Extracting corresponding summary information from the webpage resource for the page type;
    输出所述摘要信息。The summary information is output.
  2. 如权利要求1所述的方法,其特征在于,所述识别所述网页资源的页面类型的步骤包括:The method of claim 1, wherein the step of identifying a page type of the webpage resource comprises:
    抽取所述网页资源的页面框架,计算页面框架ID;Extracting a page frame of the webpage resource, and calculating a page frame ID;
    若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;If the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
    将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。The page frame pattern is matched with the page frame pattern in the pre-generated database to identify the page type.
  3. 如权利要求1或2所述的方法,其特征在于,所述网页资源包括网页源代码,所述页面类型包括单一页面,所述针对所述页面类型,从所述网页资源中提取对应的摘要信息的步骤包括:The method according to claim 1 or 2, wherein the webpage resource comprises a webpage source code, the page type comprises a single page, and the corresponding abstract is extracted from the webpage resource for the page type The steps of the information include:
    针对所述单一页面,从所述网页源代码中提取一个或多个关键的元素信息,作为摘要信息。For the single page, one or more key element information is extracted from the webpage source code as summary information.
  4. 如权利要求3所述的方法,其特征在于,所述单一页面包括以下一种或几种的组合:下载正文页面、音视频播放页面、小说阅读页面、问答页面、新闻组图页面、专题页面。The method according to claim 3, wherein the single page comprises one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a feature page. .
  5. 如权利要求1所述的方法,其特征在于,所述网页资源包括网页源代码,所述页面类型包括列表页面,所述针对所述页面类型,从所述网页资源中提取对应的摘要信息的步骤包括:The method of claim 1, wherein the webpage resource comprises a webpage source code, the page type comprises a list page, and the corresponding summary information is extracted from the webpage resource for the page type The steps include:
    针对所述列表页面,从所述网页源代码中提取所述网页资源统计出的点击率排序在前的一个或多个元素信息,作为摘要信息。And extracting, from the webpage source code, the one or more element information in which the click rate calculated by the webpage resource is ranked as the summary information.
  6. 如权利要求5所述的方法,其特征在于,所述列表页面包括音视频列表页面。The method of claim 5 wherein said list page comprises an audiovisual list page.
  7. 如权利要求1所述的方法,其特征在于,所述针对所述页面类型, 从所述网页资源中提取对应的摘要信息的步骤包括:The method of claim 1 wherein said for said page type, The step of extracting corresponding summary information from the webpage resource includes:
    针对所述页面类型,向所述网页资源对应的网站对象发送第一查询请求;Sending, to the page type, a first query request to a website object corresponding to the webpage resource;
    接收所述网站对象发送的与所述第一查询请求对应的历史访问记录,所述历史访问记录为所述网站对象从当前终端中获得cookies信息后,依据所述cookies信息获得的记录;Receiving, by the website object, a historical access record corresponding to the first query request, where the historical access record is a record obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information;
    从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
  8. 如权利要求1所述的方法,其特征在于,所述针对所述页面类型,从所述网页资源中提取对应的摘要信息的步骤包括:The method according to claim 1, wherein the step of extracting corresponding summary information from the webpage resource for the page type comprises:
    针对所述页面类型,向当前终端的浏览器发出第二查询请求,所述第二查询请求包括所述网页资源的网站对象标识;Sending, to the browser of the current terminal, a second query request, where the second query request includes a website object identifier of the webpage resource;
    接收所述浏览器返回的当前终端中与所述网站对象标识相关的历史访问记录,所述历史访问记录为当前终端的浏览器获取与所述网站对象相关的cookies信息后获得;Receiving, by the browser, a historical access record related to the website object identifier in the current terminal, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;
    从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
  9. 如权利要求7或8所述的方法,其特征在于,还包括步骤:The method of claim 7 or 8, further comprising the step of:
    对所述摘要信息添加特定标记TAG。A specific tag TAG is added to the summary information.
  10. 如权利要求1所述的方法,其特征在于,所述针对所述页面类型,从所述网页资源中提取对应的摘要信息的步骤为:The method according to claim 1, wherein the step of extracting corresponding summary information from the webpage resource for the page type is:
    针对所述页面类型,从预先生成的摘要数据库中查找与所述网页资源对应的摘要信息,所述摘要数据库存储有网页资源与对应的摘要信息。For the page type, the summary information corresponding to the webpage resource is searched from the pre-generated digest database, and the digest database stores the webpage resource and the corresponding digest information.
  11. 如权利要求3-7任一项所述的方法,其特征在于,所述摘要信息至少包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。The method according to any one of claims 3 to 7, wherein the summary information includes at least one or a combination of one of: element URL of one or more element information, element identification, element picture, element Text description information.
  12. 一种基于搜索引擎的摘要信息提取装置,包括: A summary information extraction device based on a search engine, comprising:
    网页资源获取模块,适于基于在搜索引擎中接收的搜索字符串,获取匹配的网页资源;a webpage resource obtaining module, configured to obtain a matching webpage resource based on a search string received in a search engine;
    页面类型识别模块,适于识别所述网页资源的页面类型;a page type identification module, configured to identify a page type of the webpage resource;
    摘要信息提取模块,适于针对所述页面类型,从所述网页资源中提取对应的摘要信息;a summary information extraction module, configured to extract corresponding summary information from the webpage resource for the page type;
    信息输出模块,适于输出所述摘要信息。An information output module adapted to output the summary information.
  13. 如权利要求12所述的装置,其特征在于,所述页面类型识别模块还适于:The device of claim 12, wherein the page type identification module is further adapted to:
    抽取所述网页资源的页面框架,计算页面框架ID;Extracting a page frame of the webpage resource, and calculating a page frame ID;
    若相同页面框架ID的页面框架的数量大于预设阈值,计算页面框架模式;If the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;
    将所述页面框架模式与预先生成的数据库中的页面框架模式进行匹配,识别出页面类型。The page frame pattern is matched with the page frame pattern in the pre-generated database to identify the page type.
  14. 如权利要求12或13所述的装置,其特征在于,所述网页资源包括网页源代码,所述页面类型包括单一页面,所述摘要信息提取模块还适于:The device according to claim 12 or 13, wherein the webpage resource comprises a webpage source code, the page type comprises a single page, and the digest information extracting module is further adapted to:
    针对所述单一页面,从所述网页源代码中提取一个或多个关键的元素信息,作为摘要信息。For the single page, one or more key element information is extracted from the webpage source code as summary information.
  15. 如权利要求14所述的装置,其特征在于,所述单一页面包括以下一种或几种的组合:下载正文页面、音视频播放页面、小说阅读页面、问答页面、新闻组图页面、专题页面。The device according to claim 14, wherein the single page comprises one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a special page. .
  16. 如权利要求12所述的装置,其特征在于,所述网页资源包括网页源代码,所述页面类型包括列表页面,所述摘要信息提取模块还适于:The device according to claim 12, wherein the webpage resource comprises a webpage source code, the page type comprises a list page, and the digest information extraction module is further adapted to:
    针对所述列表页面,从所述网页源代码中提取所述网页资源统计出的点击率排序在前的一个或多个元素信息,作为摘要信息。And extracting, from the webpage source code, the one or more element information in which the click rate calculated by the webpage resource is ranked as the summary information.
  17. 如权利要求16所述的装置,其特征在于,所述列表页面包括音视频列表页面。The apparatus of claim 16 wherein said list page comprises an audiovisual list page.
  18. 如权利要求12所述的装置,其特征在于,所述摘要信息提取模 块还适于:The apparatus according to claim 12, wherein said summary information extraction module The block is also suitable for:
    针对所述页面类型,向所述网页资源对应的网站对象发送第一查询请求;Sending, to the page type, a first query request to a website object corresponding to the webpage resource;
    接收所述网站对象发送的与所述第一查询请求对应的历史访问记录,所述历史访问记录为所述网站对象从当前终端中获得cookies信息后,依据所述cookies信息获得的记录;Receiving, by the website object, a historical access record corresponding to the first query request, where the historical access record is a record obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information;
    从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
  19. 如权利要求12所述的装置,其特征在于,所述摘要信息提取模块还适于:The device according to claim 12, wherein the summary information extraction module is further adapted to:
    针对所述页面类型,向当前终端的浏览器发出第二查询请求,所述第二查询请求包括所述网页资源的网站对象标识;Sending, to the browser of the current terminal, a second query request, where the second query request includes a website object identifier of the webpage resource;
    接收所述浏览器返回的当前终端中与所述网站对象标识相关的历史访问记录,所述历史访问记录为当前终端的浏览器获取与所述网站对象相关的cookies信息后获得;Receiving, by the browser, a historical access record related to the website object identifier in the current terminal, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;
    从所述历史访问记录中获取所述网页资源中访问次数大于第一阈值的元素信息,作为摘要信息。The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
  20. 如权利要求18或19所述的装置,其特征在于,还包括:The device according to claim 18 or 19, further comprising:
    标记添加模块,适于对所述摘要信息添加特定标记TAG。A tag adding module is adapted to add a specific tag TAG to the digest information.
  21. 如权利要求12所述的装置,其特征在于,所述摘要信息提取模块还适于:The device according to claim 12, wherein the summary information extraction module is further adapted to:
    针对所述页面类型,从预先生成的摘要数据库中查找与所述网页资源对应的摘要信息,所述摘要数据库存储有网页资源与对应的摘要信息。For the page type, the summary information corresponding to the webpage resource is searched from the pre-generated digest database, and the digest database stores the webpage resource and the corresponding digest information.
  22. 如权利要求14-18任一项所述的装置,其特征在于,所述摘要信息至少包括如下一种或几种的组合:一个或多个元素信息的元素URL,元素标识,元素图片,元素文字描述信息。The apparatus according to any one of claims 14 to 18, wherein the summary information includes at least one or a combination of one of: elemental URL of one or more element information, element identification, element picture, element Text description information.
  23. 一种搜索引擎,包括: A search engine that includes:
    网页资源获取模块,适于基于接收的搜索字符串,获取匹配的网页资源;a webpage resource obtaining module, configured to obtain a matching webpage resource based on the received search string;
    页面类型识别模块,适于识别所述网页资源的页面类型;a page type identification module, configured to identify a page type of the webpage resource;
    摘要信息提取模块,适于针对所述页面类型,从所述网页资源中提取对应的摘要信息;a summary information extraction module, configured to extract corresponding summary information from the webpage resource for the page type;
    信息输出模块,适于输出所述摘要信息。An information output module adapted to output the summary information.
  24. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-11中的任一个所述的基于搜索引擎的摘要信息提取方法。A computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform search engine based summary information according to any of claims 1-11 Extraction Method.
  25. 一种计算机可读介质,其中存储了如权利要求24所述的计算机程序。 A computer readable medium storing the computer program of claim 24.
PCT/CN2015/080676 2014-06-27 2015-06-03 Search engine-based summary information extraction method, apparatus and search engine WO2015196910A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410302674.7 2014-06-27
CN201410302674.7A CN104077388A (en) 2014-06-27 2014-06-27 Summary information extraction method and device based on search engine and search engine

Publications (1)

Publication Number Publication Date
WO2015196910A1 true WO2015196910A1 (en) 2015-12-30

Family

ID=51598642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/080676 WO2015196910A1 (en) 2014-06-27 2015-06-03 Search engine-based summary information extraction method, apparatus and search engine

Country Status (2)

Country Link
CN (1) CN104077388A (en)
WO (1) WO2015196910A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895568A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Method and system for processing court trial records
CN114372160A (en) * 2022-01-12 2022-04-19 北京字节跳动网络技术有限公司 Search request processing method and device, computer equipment and storage medium
CN114422309A (en) * 2021-12-03 2022-04-29 中国电子科技集团公司第二十八研究所 Method for analyzing service message transmission effect based on abstract feedback comparison mode

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077388A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Summary information extraction method and device based on search engine and search engine
CN104317930B (en) * 2014-10-31 2018-05-25 北京奇虎科技有限公司 The presentation optimization method and device of terminal searching
CN105786841A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating smart abstract of news webpage
CN105786835A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for displaying user-defined abstract of picture webpage in search result
CN105786848A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for displaying search intelligent abstract on basis of software downloading requirements
CN105786840A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Display method and system for structured abstract of music webpage
CN105786849A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating document web page custom abstract
CN105786847A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for displaying structured abstracts of commodity web page in e-commerce website
CN105786836A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating structured abstract of video webpage
CN105786853A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Display method and system for smart abstract of forum post
CN105786837A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating intelligent abstract of novel webpage
CN105786854A (en) * 2014-12-22 2016-07-20 北京奇虎科技有限公司 Method and system for generating video play webpage abstract in search result
CN105808562A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting webpage abstract based on weight
CN105808561A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting abstract from webpage
CN104699840B (en) * 2015-03-31 2016-10-19 北京奇虎科技有限公司 For providing the method and device of mobile terminal to search result
CN104866592B (en) * 2015-05-29 2018-09-07 百度在线网络技术(北京)有限公司 That makes a summary in search engine shows method and apparatus
CN106055595B (en) * 2016-05-23 2019-10-29 北京金山安全软件有限公司 Method and device for displaying value added service information and electronic equipment
US10503803B2 (en) * 2016-11-23 2019-12-10 Google Llc Animated snippets for search results
CN110020108B (en) * 2017-09-12 2023-04-28 腾讯科技(深圳)有限公司 Network resource recommendation method, device, computer equipment and storage medium
CN108090043B (en) * 2017-11-30 2021-11-23 北京百度网讯科技有限公司 Error correction report processing method and device based on artificial intelligence and readable medium
CN110162617B (en) * 2018-09-29 2022-11-04 腾讯科技(深圳)有限公司 Method, apparatus, language processing engine and medium for extracting summary information
US10938952B2 (en) * 2019-06-13 2021-03-02 Microsoft Technology Licensing, Llc Screen reader summary with popular link(s)
CN110532112B (en) * 2019-08-29 2022-10-04 维沃移动通信有限公司 Object extraction method and mobile terminal
CN110825870B (en) * 2019-10-31 2023-07-14 腾讯科技(深圳)有限公司 Method and device for acquiring document abstract, storage medium and electronic device
CN115130022A (en) * 2022-07-04 2022-09-30 北京字跳网络技术有限公司 Content search method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078019A1 (en) * 2000-10-02 2002-06-20 Lawton Scott S. Method and system for organizing search results into a single page showing two levels of detail
US20090106203A1 (en) * 2007-10-18 2009-04-23 Zhongmin Shi Method and apparatus for a web search engine generating summary-style search results
CN102591971A (en) * 2011-12-31 2012-07-18 北京百度网讯科技有限公司 Method and device for extracting webpage information
CN103761231A (en) * 2013-10-17 2014-04-30 北京奇虎科技有限公司 Method and device for providing media content information of page by search engine
CN104077388A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Summary information extraction method and device based on search engine and search engine

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163229B (en) * 2011-04-13 2013-04-17 北京百度网讯科技有限公司 Method and equipment for generating abstracts of searching results
CN102169501A (en) * 2011-04-26 2011-08-31 北京百度网讯科技有限公司 Method and device for generating abstract based on type information of document corresponding with searching result
CN103136359B (en) * 2013-03-07 2016-01-20 宁波成电泰克电子信息技术发展有限公司 Single document abstraction generating method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078019A1 (en) * 2000-10-02 2002-06-20 Lawton Scott S. Method and system for organizing search results into a single page showing two levels of detail
US20090106203A1 (en) * 2007-10-18 2009-04-23 Zhongmin Shi Method and apparatus for a web search engine generating summary-style search results
CN102591971A (en) * 2011-12-31 2012-07-18 北京百度网讯科技有限公司 Method and device for extracting webpage information
CN103761231A (en) * 2013-10-17 2014-04-30 北京奇虎科技有限公司 Method and device for providing media content information of page by search engine
CN104077388A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Summary information extraction method and device based on search engine and search engine

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895568A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Method and system for processing court trial records
CN110895568B (en) * 2018-09-13 2023-07-21 阿里巴巴集团控股有限公司 Method and system for processing court trial records
CN114422309A (en) * 2021-12-03 2022-04-29 中国电子科技集团公司第二十八研究所 Method for analyzing service message transmission effect based on abstract feedback comparison mode
CN114422309B (en) * 2021-12-03 2023-08-11 中国电子科技集团公司第二十八研究所 Service message transmission effect analysis method based on abstract return comparison mode
CN114372160A (en) * 2022-01-12 2022-04-19 北京字节跳动网络技术有限公司 Search request processing method and device, computer equipment and storage medium
CN114372160B (en) * 2022-01-12 2023-08-15 抖音视界有限公司 Search request processing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN104077388A (en) 2014-10-01

Similar Documents

Publication Publication Date Title
WO2015196910A1 (en) Search engine-based summary information extraction method, apparatus and search engine
US11720577B2 (en) Contextualizing knowledge panels
US10706100B2 (en) Method of and system for recommending media objects
US10248662B2 (en) Generating descriptive text for images in documents using seed descriptors
US10275485B2 (en) Retrieving context from previous sessions
US9268856B2 (en) System and method for inclusion of interactive elements on a search results page
US10157232B2 (en) Personalizing deep search results using subscription data
US10503803B2 (en) Animated snippets for search results
US10860638B2 (en) System and method for interactive searching of transcripts and associated audio/visual/textual/other data files
KR20110085995A (en) Providing search results
EP2707819A1 (en) Dynamic image display area and image display within web search results
US9449054B1 (en) Methods, systems, and media for providing a media search engine
US20130117716A1 (en) Function Extension for Browsers or Documents
US9916384B2 (en) Related entities
US9043320B2 (en) Enhanced find-in-page functions in a web browser
US10223460B2 (en) Application partial deep link to a corresponding resource
US20110099134A1 (en) Method and System for Agent Based Summarization
JP5386548B2 (en) Soaring word extraction apparatus and method
US20160335314A1 (en) Method of and a system for determining linked objects
JP5950737B2 (en) Information extraction apparatus and program
JP2014142769A (en) Text extraction device, text extraction method and text extraction program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15812097

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15812097

Country of ref document: EP

Kind code of ref document: A1