WO2015196910A1

WO2015196910A1 - Search engine-based summary information extraction method, apparatus and search engine

Info

Publication number: WO2015196910A1
Application number: PCT/CN2015/080676
Authority: WO
Inventors: 董毅; 张前川; 陈营营; 张川
Original assignee: 北京奇虎科技有限公司; 奇智软件（北京）有限公司
Priority date: 2014-06-27
Filing date: 2015-06-03
Publication date: 2015-12-30
Also published as: CN104077388A

Abstract

A search engine-based summary information extraction method, apparatus and search engine. The method comprises: obtaining matched page resources on the basis of a search character string received in the search engine (101); identifying the page type of the page resources (102); in regard of the page type, extracting corresponding summary information from the page resources (103); and outputting the summary information (104). The situation where a user looks for desired information by frequently clicking pages corresponding to search results can be reduced, and therefore the retrieval speed is improved, the times of interaction of the search engine is reduced and the data processing speed is increased.

Description

Search engine-based summary information extraction method, device and search engine

Technical field

The present invention relates to the technical field of information retrieval, and in particular, to a search engine-based summary information extraction method, a search engine-based summary information extraction method, and a search engine.

Background technique

In today's era when network information is extremely rich, search engines have become an indispensable tool for users to search for massive resources.

In order to enhance the effect of the search result display, the search engine provides search results, in addition to the page title and URL, may also include providing a summary from the web page. Currently, the way search engines generate summaries can be summarized as follows:

One is the static method, that is, independent of the query, according to some rules, some text is extracted from the webpage content in the preprocessing stage in advance, for example, the first 512 bytes of the webpage text (corresponding to 256 Chinese characters), or each paragraph The first sentence is put together, and so on. The digest thus formed is stored in the query subsystem, and once the relevant document is selected to match the query item, it is read back to the user. Obviously, this approach is the easiest for the query subsystem and does not require additional processing. But one of the biggest drawbacks of this approach is that the summary is not related to the query.

The user wants to highlight and query the text directly corresponding to the abstract, and hope that the sentence related to the text he cares about appears in the abstract. Therefore, the dynamic summary method comes into being. The dynamic summary is to extract the surrounding text according to the position of the query word in the document when responding to the query, and highlight the query word when displaying. This is the way most search engines currently use.

Although the content of the dynamic summary contains the user's query terms, these sentences do not express the central meaning of the entire Web document. In other words, the user does not know whether the information he or she is looking for is included in this page by reading the summary returned by the search engine. At this time, the user needs to click the search result to check whether the information that is desired is included in the webpage corresponding to the search result, and the multiple interaction process consumes bandwidth resources, and the search efficiency is low.

Summary of the invention

In view of the above problems, the present invention has been made in order to provide a search engine-based summary information extraction method and a corresponding search engine-based summary information extraction method and a corresponding method for overcoming the above problems or at least partially solving or alleviating the above problems. Kind of search engine.

According to an aspect of the present invention, a search engine-based summary information extraction method is provided, including:

Obtain matching webpage resources based on the search string received in the search engine;

Identifying a page type of the webpage resource;

Extracting corresponding summary information from the webpage resource for the page type;

The summary information is output.

According to another aspect of the present invention, a search engine-based summary information extracting apparatus is provided, including:

a webpage resource obtaining module, configured to obtain a matching webpage resource based on a search string received in a search engine;

a page type identification module, configured to identify a page type of the webpage resource;

a summary information extraction module, configured to extract corresponding summary information from the webpage resource for the page type;

An information output module adapted to output the summary information.

According to another aspect of the present invention, a search engine is provided, comprising:

a webpage resource obtaining module, configured to obtain a matching webpage resource based on the received search string;

An information output module adapted to output the summary information.

According to still another aspect of the present invention, a computer program is provided, comprising computer readable code that, when executed on a computing device, causes the computing device to perform the search engine based summary information described above Extraction Method.

According to still another aspect of the present invention, a computer readable medium is provided, wherein Computer program.

The beneficial effects of the invention are:

In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and the summary information outputted in the search result is by identifying the webpage resource. After the page type, the web page resources of different page types are extracted. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.

In addition, in the embodiment of the present invention, after obtaining the matched webpage resource, the corresponding cookie information is obtained according to the webpage resource, and the historical access record of the user is obtained according to the cookie information, and the webpage resource is accessed from the historical access record. Element information whose number is greater than the first threshold is used as summary information. Therefore, the summary information displayed in the search result is personalized summary information for different users, and the user experience is improved, and the information provided to the user in the summary information is more valuable, and the user can obtain the desired information from the summary information. The information reduces the occurrence of the user searching for the required information by frequently clicking the page corresponding to the search result, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and improving the data processing rate.

The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.

DRAWINGS

Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:

FIG. 1 schematically illustrates a search engine based digest in accordance with one embodiment of the present invention. A flow chart of the steps of the first embodiment of the information extraction method;

FIG. 2 is a flow chart showing the steps of a second embodiment of a method for extracting summary information based on a search engine according to an embodiment of the present invention; FIG.

FIG. 2-a is a schematic diagram showing a download text page of Embodiment 2 of a search engine-based summary information extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing a first output result of Embodiment 2 of a summary information extraction method based on a search engine according to an embodiment of the present invention;

FIG. 3 is a flow chart showing the steps of a third embodiment of a method for extracting summary information based on a search engine according to an embodiment of the present invention; FIG.

FIG. 3 is a schematic diagram showing a top page of a video website according to Embodiment 3 of a method for extracting summary information based on a search engine according to an embodiment of the present invention;

FIG. 3 is a schematic diagram showing a second output result of Embodiment 3 of a summary information extraction method based on a search engine according to an embodiment of the present invention;

FIG. 4 is a flow chart showing the steps of a method for extracting summary information based on a search engine according to an embodiment of the present invention; FIG.

FIG. 4 is a schematic diagram showing a top page of a video website according to Embodiment 4 of a method for extracting summary information based on a search engine according to an embodiment of the present invention;

FIG. 4 is a schematic diagram showing a third output result of Embodiment 4 of a search engine-based summary information extraction method according to an embodiment of the present invention;

FIG. 5 is a flow chart showing the steps of Embodiment 5 of a summary information extraction method based on a search engine according to an embodiment of the present invention; FIG.

6 is a block diagram showing the structure of an embodiment of a summary information extracting apparatus based on a search engine according to an embodiment of the present invention;

FIG. 7 is a block diagram showing a structural diagram of an embodiment of a search engine according to an embodiment of the present invention; FIG.

Figure 8 shows schematically a block diagram of a computing device for performing the method according to the invention;

Figure 9 shows schematically the procedure for maintaining or carrying out the method according to the invention. The storage unit of the code.

Specific embodiment

The invention is further described below in conjunction with the drawings and specific embodiments.

Referring to FIG. 1 , a flow chart of a first step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:

Step 101: Obtain a matching webpage resource based on a search string received in a search engine.

Step 102: Identify a page type of the webpage resource.

Step 103: Extract corresponding summary information from the webpage resource for the page type.

Step 104: Output the summary information.

Referring to FIG. 2, a flow chart of a step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:

Step 201: Acquire a matching webpage resource based on a search string received in a search engine, where the webpage resource includes a webpage source code;

The search string query is the search information entered by the user in the search engine interface to express the user's intention and request to search for web resources related thereto.

After the search engine receives the search string input by the user, the search string is segmented, After the processing of the stop word, the typo judgment, and the like, all the web resources containing the search string are searched from the pre-established index database as matching web resources. The webpage resource may include information such as a webpage text, a webpage URL address, a webpage source code constituting the webpage, and a link to and from the webpage.

Step 202: Identify a page type of the webpage resource, where the page type includes a single page;

After obtaining the webpage resource, the corresponding page type may be further identified according to the webpage resource. In a preferred embodiment of the present invention, the step 202 may include the following substeps:

Sub-step S11, extracting a page frame of the webpage resource, and calculating a page frame ID;

In a specific implementation, the method for extracting the page frame of the webpage resource may be: extracting the page frame of the webpage according to the html language tag in the source code of the webpage, and only retaining the frame class tag in the html language tag, such as a frame, a table, etc., when extracting, Also keep the id, name, and class attributes and remove the remaining attributes. You can also identify the body of the page by punctuation and remove the body to get the page frame of the page.

After extracting the page frame, the attributes in the page can be calculated according to the hash algorithm, and the hash value of the page frame is the page frame ID. For example, the frame class tag such as frame, table, and its id, name, and class attributes are hashed. The calculation is performed and the result is the page frame ID. Since the same hash function is used, the page frame ID calculated by the same page frame is also the same.

Sub-step S12, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;

In practice, when calculating the page frame mode, the title, time, and web page text are respectively calculated, and the calculation method may adopt a machine automatic learning mechanism, such as using a support vector machine SVM (Support Vector Machine) to calculate the page frame mode. During the learning, the extracted page frame is input into the SVM for learning, that is, the html language tag key tag is matched to the page frame, and the html language tag key tags in the page frame of the same ID can be completely matched, so for the page with the same ID After the framework learns the number of the preset thresholds, the SVM outputs the page frame mode of the corresponding page frame.

Sub-step S13, matching the page frame mode with a page frame mode in a pre-generated database to identify the page type.

The pre-generated database stores a known type of page frame mode and weights of each web page feature in the mode, and adds corresponding weights to the page frame according to different categories, if the weight of the corresponding page is the highest, The page is the corresponding page type.

The page type in the embodiment of the present invention may include a single page, and/or a list page. The single page is a page with a single page element, and may include one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a special page. The page table page may include an audio and video list page.

Step 203: Extract one or more key element information from the webpage source code as summary information for the single page;

The summary information may include at least one or a combination of the following: an element URL of the one or more element information, an element identifier, an element picture, and an element text description information.

In a specific implementation, if the page type of the webpage resource matching the search string is a single page, one or more key element information may be extracted according to the content in the html language tag in the webpage source code, and the html language tag may be Includes <a> tags (define hyperlinks whose attributes href attribute indicates the target of the link), <meta> tags (which provide meta-information about the page, such as descriptions and keyes for search engines and update frequency) Word), <span> tags (combining inline elements), <div> tags, <p> tags, <script> tags, <classs> tags, and more. For example, for a download body page, the corresponding element information can be obtained as summary information from the following code:

<p class=“toolInfo”>56.6M|Update date 2014/01/03</p>

<p class=“roundIcon”><a href=”intro.shtml” target=”_blank”class=“link” title=“Functional Animation Display”>Functional Animation Display</a></p>

<a

Href=”http://dldirl.XX.com/XXfile/XX/XX2013/XX2013SP6/9305/XX2013 SP6.exe"class="downBtn"title="Download now" onclick="tcssClick&&tcssClick('downXX')">Download now</a>

</div>

Where XX is the corresponding download object, the corresponding element information or summary information is: 56.6M| update date 2014/01/03; download address is: http://dldir1.XX.com/XXfile/XX/XX2013/ XX2013SP6/9305/XX2013SP6.exe.

Step 204: Output the summary information.

After obtaining the summary information corresponding to the webpage resource, the summary information may be output in the preset position of the corresponding search result when the search result is output.

For example, as shown in FIG. 2-a, the download body page 200 has information such as a download object identifier 210, a download object description 220, a download address 1230, and a download address 2240. The download object identifier may be XX software. The official version, etc., the download object description can include software size, update time, software language, provider, software license, software rating, application platform, software function introduction and other information. In the download body page 200, the user's main requirement is the download address, so the download address link in the page can be extracted by step 203, and the summary information of the search result is displayed, so that the user can obtain the download address directly from the summary information. Downloading the download object does not require entering the page where the search result is located to find the download address, and the output summary information is shown in the first output result diagram of FIG. 2-b.

In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and after identifying the page type of the webpage resource, the webpage resource for a single page. , extract the corresponding summary information from the source code. Therefore, the summary information displayed in the search result expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the frequent users. Clicking on the page corresponding to the search result to find the required information occurs, thereby improving the retrieval speed, reducing the number of interactions of the search engine, and increasing the data processing rate.

Referring to FIG. 3, a flowchart of a step of a method for extracting summary information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:

Step 301: Acquire a matching webpage resource based on a search string received in a search engine, where the webpage resource includes a webpage source code;

Step 302: Identify a page type of the webpage resource, where the page type includes a list page;

In a preferred embodiment of the invention, the step 302 may comprise the following sub-steps:

Sub-step S21, extracting a page frame of the webpage resource, and calculating a page frame ID;

Sub-step S22, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;

Sub-step S23, matching the page frame mode with the page frame mode in the pre-generated database to identify the page type.

The page type in the embodiment of the present invention may include a single page, and/or a list page. The list page is a page with more page elements, and may include a list page such as an audio and video home page.

Step 303: Extract, from the webpage source code, one or more element information in which the click rate of the webpage resource is sorted, as the summary information, from the webpage source code.

In a specific implementation, if the page type of the webpage resource matching the search string is a list page, the click rate data (such as a video leaderboard) counted by the webpage may be obtained according to the content in the html language tag in the source code of the webpage. And then extract one or more sorted element information from the click rate data as summary information, and the html language tag may include an <a> tag (defining a hyperlink whose attribute href attribute indicates the target of the link), <meta> Tags (which provide meta-information about the page, such as descriptions and keywords for search engines and update frequency), <span> tags (combining inline elements), <div> tags, <p> tags, <script> tags, <classs> tags, and more. For example, for the video site home page, The corresponding element information can be obtained from the following code as summary information:

<a class=“name” target=”_blank”href=“http://v.youku.com/v_show/id_XNzIxNzc0NTUy.html” data-from=”1-1”> Sharp XX DVD Edition</a>

</div>

The information in the summary information showing the first place is the sharp XX DVD version. In practice, each element information may include at least one or more of the following attributes: element URL, element identification, element picture, element text description information. Therefore, for the above example, the playback URL, name, picture and other information of the sharp XX DVD version can be given in the summary information.

Step 304: Output the summary information.

It should be noted that, when the summary information is output, the one or more element information may be displayed in the search result in the form of a carousel.

For example, the home page of the video website shown in FIG. 3-a may include a video category list 310, a video of each video category, and a corresponding leaderboard (such as category 1 leaderboard 320). And other information, wherein the video category list may include TV dramas, movies, variety shows, music, animation, travel, etc., such as category 1 330 is a TV series, then video A to video F are various TV drama programs, category 1 leaderboard can In order, it is video A, video B, video D, video F, and so on. Then, in step 303, the video programs in the video website 300 are in the top of the ranking list (such as the first two, and the specific number can be set as needed, and the embodiment of the present invention does not need to be limited thereto). As shown in the second output result diagram of FIG. 3-b, the video A, the video B, and the like displayed in the summary information may include a name, a play URL, a picture, and/or a text description of the corresponding video.

In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all the webpage resources including the search string as the matched webpage resource, and after identifying the page type of the webpage resource, the webpage resource for the listpage. , extract the corresponding summary information from the source code. So that the summary information displayed in the search results expresses the entire page document The accuracy of the central meaning is higher, the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the occurrence of the user searching for the required information by frequently clicking the page corresponding to the search result. In turn, the retrieval speed is improved, the number of interactions of the search engine is reduced, and the data processing rate is improved.

Referring to FIG. 4, a flow chart of a method for extracting a digest information based on a search engine according to an embodiment of the present invention is shown. The embodiment of the present invention may include the following steps:

Step 401: Acquire a matching webpage resource based on a search string received in a search engine;

Step 402: Identify a page type of the webpage resource.

In a preferred embodiment of the invention, the step 402 may comprise the following sub-steps:

Sub-step S31, extracting a page frame of the webpage resource, and calculating a page frame ID;

Sub-step S32, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;

Sub-step S33, the page frame mode is matched with the page frame mode in the pre-generated database to identify the page type.

Step 403: Extract corresponding summary information from the webpage resource for the page type.

The embodiment of the present invention may display the element information related to the historical access record in the summary information according to the history access record of the matched web resource by the user, which may be:

In a preferred embodiment of the invention, step 403 can include the following sub-steps:

Sub-step S41, sending, to the webpage object corresponding to the webpage resource, a first query request for the page type;

Sub-step S42, receiving a historical access record corresponding to the first query request sent by the website object, where the historical access record is obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information. recording;

Sub-step S43, the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.

Specifically, if the webpage resource matching the search string query belongs to a certain website object, the search engine may issue a first query request to the website object, and the first query request is a request for informing the website object that the user has a query. After receiving the first query request, the website object obtains the corresponding cookie information from the current terminal, and obtains the current user's historical access record according to the cookie information, and feeds back to the search engine, and the search engine obtains the location according to the received historical access record. The element information of the webpage resource whose access times are greater than the first threshold is used as the digest information, thereby providing the user with personalized digest information. The first threshold may be 1 or other integer values, which is not limited in this embodiment of the present invention.

In another preferred embodiment of the invention, step 403 can include the following sub-steps:

Sub-step S51, sending, to the browser of the current terminal, a second query request for the page type, where the second query request includes a website object identifier of the webpage resource;

Sub-step S52, receiving a historical access record related to the website object identifier in the current terminal returned by the browser, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;

Sub-step S53: Obtain, from the historical access record, element information of the webpage resource whose access times are greater than a first threshold, as summary information.

Specifically, if the webpage resource matching the search string query belongs to a certain website object, the search engine may issue a second query request to the browser of the current terminal to request the browser of the current terminal to retrieve the user's access to the website object. Cookie information. After receiving the second query request, the browser of the current terminal obtains the cookie information corresponding to the identifier of the website object from the current terminal, and obtains the historical access record of the current user according to the cookie information, and feeds back to the search engine, and the search engine receives the information according to the receipt. The obtained historical access record obtains element information of the webpage resource whose access times are greater than the first threshold as the digest information, thereby providing the user with personalized digest information.

Step 404, adding a specific tag TAG to the summary information;

In the embodiment of the present invention, after the personalized summary information is extracted according to the historical access record of the user, the specific identifier TAG may be added to the personalized summary information, for example, the personalized summary information is marked with a recommendation mark.

Step 405, output the summary information added with the specific tag TAG.

In a specific implementation, the summary information includes at least one or a combination of the following: an element URL of one or more element information, an element identifier, an element picture, and an element text description information.

For example, the home page of the video website shown in FIG. 4-a may include a video category list 410, a video of each video category, and a corresponding leaderboard (such as category 1 leaderboard 420). And other information, wherein the video category list may include TV dramas, movies, variety shows, music, animation, travel, etc., if category 1 430 is a TV series, then video A to video F are various TV drama programs, category 1 leaderboard can In order, it is video A, video B, video D, video F, and so on. The historical access record of the user to the video website 400 can be obtained by step 403. If the video of the video website has been viewed by the user, the video E and the video F are displayed, and the video viewed by the user is marked with “excellent” (specific mark). The content can be set as needed, which is not limited by the embodiment of the present invention, and is shown in the abstract, as shown in the third output result diagram of FIG. 4-b. The video A, the video B, and the like displayed in the summary information may include a name, a play URL, a picture, and/or a text description of the corresponding video.

In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all webpage resources including the search string as matching webpage resources, and after identifying the page type of the webpage resource, for different page types, Corresponding cookie information is obtained according to the webpage resource, and the user's historical access record is obtained according to the cookie information, and the element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as the summary information. Therefore, the summary information displayed in the search result is personalized summary information for different users, and the information provided to the user is more valuable, and the user can obtain the desired information from the summary information, thereby reducing the user's frequent clicks on the search result. Corresponding pages are used to find the required information, which improves the retrieval speed, reduces the number of interactions between search engines, and increases the data processing rate.

Referring to FIG. 5, a flow chart of the steps of the fifth embodiment of the method for extracting summary information based on the search engine is illustrated. The embodiment of the present invention may include the following steps:

Step 501: Obtain a matching webpage resource based on a search string received in a search engine. source;

Step 502: Identify a page type of the webpage resource.

In a preferred embodiment of the invention, the step 502 can include the following sub-steps:

Sub-step S61, extracting a page frame of the webpage resource, and calculating a page frame ID;

Sub-step S62, if the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;

Sub-step S63, the page frame mode is matched with the page frame mode in the pre-generated database to identify the page type.

Step 503: Search for, according to the page type, summary information corresponding to the webpage resource from a pre-generated digest database, where the digest database stores webpage resources and corresponding digest information;

Specifically, in addition to the summary information of each hit webpage resource obtained in real time as described in the foregoing first to fourth embodiments, the embodiment of the present invention may further extract the summary information of each webpage resource in advance when the spider crawls the webpage, and store In the summary database, the summary information in the summary database is updated every preset time period, and when a certain webpage resource is hit, the summary information corresponding to the webpage resource is obtained from the digest database.

Step 504, output the summary information.

The summary information includes at least one or a combination of the following: an element URL of one or more element information, an element identifier, an element picture, and an element text description information.

In the embodiment of the present invention, after receiving the search string input by the user, the search engine searches for all the webpage resources including the search string as the matched webpage resources, and searches for the webpage resource corresponding to the webpage resource through the pre-generated digest database. The summary information is output in the search results to improve the search speed, and the summary information displayed in the search results expresses the central meaning of the entire page document with higher accuracy, and the information provided to the user is more valuable, and the user is from the summary information. The information can be obtained, and the user can find the required information by frequently clicking the page corresponding to the search result, thereby reducing the number of interactions of the search engine and increasing the data processing rate.

For the method embodiment, for the sake of simple description, it is expressed as a series of action groups. It will be appreciated by those skilled in the art that the present invention is not limited by the order of the acts described, as some steps may be performed in other sequences or concurrently in accordance with the present invention. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

Referring to FIG. 6, a structural block diagram of an embodiment of a search engine-based summary information extracting apparatus according to an embodiment of the present invention is shown, and the apparatus may include the following modules.

The webpage resource obtaining module 601 is adapted to obtain a matching webpage resource based on the search string received in the search engine;

a page type identification module 602, configured to identify a page type of the webpage resource;

The summary information extraction module 603 is adapted to extract corresponding summary information from the webpage resource for the page type;

The information output module 604 is adapted to output the summary information.

In a preferred embodiment of the present invention, the page type identification module 602 is further adapted to:

Extracting a page frame of the webpage resource, and calculating a page frame ID;

If the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;

The page frame pattern is matched with the page frame pattern in the pre-generated database to identify the page type.

In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a single page, and the digest information extraction module 603 is further adapted to:

For the single page, one or more key element information is extracted from the webpage source code as summary information.

As a preferred example of the embodiment of the present invention, the single page may include one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a feature page.

In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a list page, and the digest information extraction module 603 is further adapted to:

And extracting, from the webpage source code, the one or more element information in which the click rate calculated by the webpage resource is ranked as the summary information.

As a preferred example of an embodiment of the present invention, the list page may include an audio and video list page.

In a preferred embodiment of the present invention, the summary information extraction module 603 is further adapted to:

Sending, to the page type, a first query request to a website object corresponding to the webpage resource;

Receiving, by the website object, a historical access record corresponding to the first query request, where the historical access record is a record obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information;

The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.

Sending, to the browser of the current terminal, a second query request, where the second query request includes a website object identifier of the webpage resource;

Receiving, by the browser, a historical access record related to the website object identifier in the current terminal, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;

In a preferred embodiment of the present invention, the embodiment of the present invention may further include:

A tag adding module is adapted to add a specific tag TAG to the digest information.

For the page type, the summary information corresponding to the webpage resource is searched from the pre-generated digest database, and the digest database stores the webpage resource and the corresponding digest information.

As a preferred example of the embodiment of the present invention, the summary information may include at least one or a combination of one element: one or more element information, an element identifier, an element picture, and an element text description information.

Referring to FIG. 7, a structural block diagram of an embodiment of a search engine according to an embodiment of the present invention is shown. The search engine may include the following modules.

The webpage resource obtaining module 701 is adapted to obtain a matching webpage resource based on the received search string;

a page type identification module 702, configured to identify a page type of the webpage resource;

The summary information extraction module 703 is adapted to extract corresponding summary information from the webpage resource for the page type;

The information output module 704 is adapted to output the summary information.

In a preferred embodiment of the present invention, the page type identification module 702 is further adapted to:

In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a single page, and the digest information extraction module 703 is further adapted to:

In a preferred embodiment of the present invention, the webpage resource includes a webpage source code, the page type includes a list page, and the digest information extraction module 703 is further adapted to:

In a preferred embodiment of the present invention, the summary information extraction module 703 is further adapted to:

The various embodiments in the present specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments can be referred to each other. For a device or search engine embodiment, due to its and method embodiments Basically similar, so the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the components of the processing device based on the search engine based summary information extraction in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP). Or all features. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

For example, Figure 8 illustrates a computing device, such as a retrieval server, that can implement search engine based summary information extraction in accordance with the present invention. The computing device conventionally includes a processor 810 and a computer program product or computer readable medium in the form of a memory 820. The memory 820 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. Memory 820 has a memory space 830 for program code 831 for performing any of the method steps described above. For example, storage space 830 for program code may include various program code 831 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similar to the storage 820 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 831', ie, code readable by a processor, such as 810, that when executed by a computing device causes the computing device to perform each of the methods described above step.

"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.

It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.

In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

A method for extracting summary information based on a search engine, comprising the steps of:

Obtain matching webpage resources based on the search string received in the search engine;

Identifying a page type of the webpage resource;

Extracting corresponding summary information from the webpage resource for the page type;

The summary information is output.
The method of claim 1, wherein the step of identifying a page type of the webpage resource comprises:

Extracting a page frame of the webpage resource, and calculating a page frame ID;

If the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;

The page frame pattern is matched with the page frame pattern in the pre-generated database to identify the page type.
The method according to claim 1 or 2, wherein the webpage resource comprises a webpage source code, the page type comprises a single page, and the corresponding abstract is extracted from the webpage resource for the page type The steps of the information include:

For the single page, one or more key element information is extracted from the webpage source code as summary information.
The method according to claim 3, wherein the single page comprises one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a feature page. .
The method of claim 1, wherein the webpage resource comprises a webpage source code, the page type comprises a list page, and the corresponding summary information is extracted from the webpage resource for the page type The steps include:

And extracting, from the webpage source code, the one or more element information in which the click rate calculated by the webpage resource is ranked as the summary information.
The method of claim 5 wherein said list page comprises an audiovisual list page.
The method of claim 1 wherein said for said page type, The step of extracting corresponding summary information from the webpage resource includes:

Sending, to the page type, a first query request to a website object corresponding to the webpage resource;

Receiving, by the website object, a historical access record corresponding to the first query request, where the historical access record is a record obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information;

The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
The method according to claim 1, wherein the step of extracting corresponding summary information from the webpage resource for the page type comprises:

Sending, to the browser of the current terminal, a second query request, where the second query request includes a website object identifier of the webpage resource;

Receiving, by the browser, a historical access record related to the website object identifier in the current terminal, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;

The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
The method of claim 7 or 8, further comprising the step of:

A specific tag TAG is added to the summary information.
The method according to claim 1, wherein the step of extracting corresponding summary information from the webpage resource for the page type is:

For the page type, the summary information corresponding to the webpage resource is searched from the pre-generated digest database, and the digest database stores the webpage resource and the corresponding digest information.
The method according to any one of claims 3 to 7, wherein the summary information includes at least one or a combination of one of: element URL of one or more element information, element identification, element picture, element Text description information.
A summary information extraction device based on a search engine, comprising:

a webpage resource obtaining module, configured to obtain a matching webpage resource based on a search string received in a search engine;

a page type identification module, configured to identify a page type of the webpage resource;

a summary information extraction module, configured to extract corresponding summary information from the webpage resource for the page type;

An information output module adapted to output the summary information.
The device of claim 12, wherein the page type identification module is further adapted to:

Extracting a page frame of the webpage resource, and calculating a page frame ID;

If the number of page frames of the same page frame ID is greater than a preset threshold, calculating a page frame mode;

The page frame pattern is matched with the page frame pattern in the pre-generated database to identify the page type.
The device according to claim 12 or 13, wherein the webpage resource comprises a webpage source code, the page type comprises a single page, and the digest information extracting module is further adapted to:

For the single page, one or more key element information is extracted from the webpage source code as summary information.
The device according to claim 14, wherein the single page comprises one or a combination of the following: a download body page, an audio and video play page, a novel reading page, a question and answer page, a news group map page, and a special page. .
The device according to claim 12, wherein the webpage resource comprises a webpage source code, the page type comprises a list page, and the digest information extraction module is further adapted to:

And extracting, from the webpage source code, the one or more element information in which the click rate calculated by the webpage resource is ranked as the summary information.
The apparatus of claim 16 wherein said list page comprises an audiovisual list page.
The apparatus according to claim 12, wherein said summary information extraction module The block is also suitable for:

Sending, to the page type, a first query request to a website object corresponding to the webpage resource;

Receiving, by the website object, a historical access record corresponding to the first query request, where the historical access record is a record obtained by the website object after obtaining the cookie information from the current terminal, according to the cookie information;

The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
The device according to claim 12, wherein the summary information extraction module is further adapted to:

Sending, to the browser of the current terminal, a second query request, where the second query request includes a website object identifier of the webpage resource;

Receiving, by the browser, a historical access record related to the website object identifier in the current terminal, where the historical access record is obtained after the browser of the current terminal acquires the cookie information related to the website object;

The element information of the webpage resource whose access count is greater than the first threshold is obtained from the historical access record as summary information.
The device according to claim 18 or 19, further comprising:

A tag adding module is adapted to add a specific tag TAG to the digest information.
The device according to claim 12, wherein the summary information extraction module is further adapted to:

For the page type, the summary information corresponding to the webpage resource is searched from the pre-generated digest database, and the digest database stores the webpage resource and the corresponding digest information.
The apparatus according to any one of claims 14 to 18, wherein the summary information includes at least one or a combination of one of: elemental URL of one or more element information, element identification, element picture, element Text description information.
A search engine that includes:

a webpage resource obtaining module, configured to obtain a matching webpage resource based on the received search string;

a page type identification module, configured to identify a page type of the webpage resource;

a summary information extraction module, configured to extract corresponding summary information from the webpage resource for the page type;

An information output module adapted to output the summary information.
A computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform search engine based summary information according to any of claims 1-11 Extraction Method.
A computer readable medium storing the computer program of claim 24.