WO2020067870A1

WO2020067870A1 - Method and system for providing a content list based on a search query

Info

Publication number: WO2020067870A1
Application number: PCT/MY2019/050061
Authority: WO
Inventors: May Fern KOH; Thiam Ho SECK; Tong Khin Thong
Original assignee: Mimos Berhad
Priority date: 2018-09-28
Filing date: 2019-09-20
Publication date: 2020-04-02

Abstract

The invention provides a method and system for providing a content list based on a search query. The method and system generally require that one or more domains associated with the search query be identified by a data classifier (104), so as to enrich the search query using domain knowledge of the identified domains. Based on the enriched query, contents may be crawled by a data collector (102) from a plurality of websites if they are found to be associated with the enriched query. Later, the crawled contents are classified into various categories based on named entities extracted from the contents. The classified contents are subjected to deduplication, in order to eliminate or remove duplicative contents, before they can be displayed to the user.

Description

METHOD AND SYSTEM FOR PROVIDING A CONTENT LIST

BASED ON A SEARCH QUERY

Field of Invention

The invention relates to information retrieval. More particularly, it relates to retrieving and managing information including web contents in response to a search query.

Background of the Invention

Users frequently locate and retrieve content items (such as files, images, videos, etc.) from various websites by performing searches through one or more search engines. In general, the searches are performed based on one or a string of keywords provided by a user to a search engine that returns results listing content items corresponding to the keywords. However, the results returned by the search engine are often unsatisfactory and thus, more time may be required to locate the relevant content items. In addition, the results may also include content items that are identical or nearly identical to each other (i.e. duplicative content items).

To overcome the above shortcomings, there are numerous methods and systems in the art for optimising information retrieval. For example, US Patent No. US 8,214,359 Bl discloses a duplicate detection technique that uses query-relevant information to limit the portion or portions of documents to be compared for similarity. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, the query-relevant information or text (also referred to as“snippets”) is extracted from the documents and only the extract snippets, rather than the entire documents, are compared for purposes of determining similarity.

U.S. Pub. No. US 2015/0161267 Al also discloses methods, systems and apparatus, including computer programs encoded on a computer storage medium, for identifying search results that will be provided in response to a search query received from a user device. Two or more search results may reference at least two different resources that are responsive to the search query. It may be determined that the user device will be served a same set of content in response to user interaction with each search result. A replacement search result may be provided in response to the determination, including a reference to a resource serving the same set of content. In response to receiving the search query, a search page may be presented to the user that includes the replacement search result and does not include at least one of the identified search results.

Although numerous systems and methods have been provided in the art for optimising information retrieval, they are still unable to generate satisfactory results to the users. For example, the search results are generated based only on the keywords provided by the users, thereby limiting the results to be returned by the search engines. Further, the users are also required to have knowledge about the domain in their search queries, so that more extensive searches can be performed. In addition, the systems and methods in the art are rather rigid and not able to be changed. Therefore, there exists a need to provide a method and system for improving information retrieval.

Summary of the Invention

One of the objects of the invention is to provide a system and method for providing a content list based on a search query received through a search engine in an automated configuration, thereby reducing the time spent on searching and retrieving contents. In particular, operations such as crawling, enrichment, categorisation, deduplication, etc. are performed automatically depending on specific information (which may relate to domain information, named entity information or both).

Another object of the invention is to provide a system and method that enables intelligent selection of a deduplication technique based on a set of fuzzy logic rules, thereby enhancing the data deduplication and improving accuracy of the deduplication results.

Still another object of the invention is to provide a system and method that enables dynamic and automated population of contents associated with the user search query in templates selected based on categories of the contents, before they are presented in a data dashboard to the user.

At least one of the preceding objects is met, in whole or in part, by the invention, in which one of the embodiments of the invention describes a system for providing a content list based on a search query received through a search engine, comprising a data collector for enriching the search query using domain knowledge and crawling contents from a plurality of websites based on the enriched query; a data classifier for classifying the crawled contents received from the data collector according to named entities extracted from the crawled contents; a deduplication module for subjecting the classified contents of the same named entity to deduplication, to generate the content list without duplicative contents; wherein the deduplication module is configured to select a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and then apply the selected technique to the classified contents, so that one or more duplicative contents can be detected and removed when detected. Preferably, the system may further comprise a layout module for displaying de-duplicated contents in a template (or templates) selected based on the named entity of the de-duplicated contents.

In some preferred embodiments, the data collector may be configured to enrich the search query by expanding one or a string of input keywords contained in the search query based on domain knowledge.

The data collector may also be further configured to filter the crawled contents based on a keyword white list and a keyword blacklist.

In some preferred embodiments, the data classifier may be further configured to identify one or more domains associated with the search query, before the data collector may enrich the search query.

A further embodiment of the invention is a method for providing a content list based on a search query received from a user through a search engine, comprising the steps of identifying one or more domains associated with the search query; enriching the search query using domain knowledge of the identified domains; crawling contents from a plurality of websites based on the enriched query; classifying the crawled contents according to named entities extracted from the crawled contents; subjecting the classified contents of the same named entity to deduplication, so that the content list can be generated without duplicative contents; wherein the deduplication is performed by selecting a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and subsequently applying the selected technique to the classified contents to detect one or more duplicative contents and remove the duplicative contents when detected.

The method may further comprise a step of filtering the crawled contents based on a keyword white list and a keyword blacklist, before the classifying step.

The method may further comprise a step of displaying the de-duplicated contents in a template selected based on the named entity of the de-duplicated contents.

In some further embodiments, the enriching step may preferably be performed by expanding one or a string of input keywords contained in the search query based on domain knowledge.

One skilled in the art will readily appreciate that the invention is well adapted to carry out the aspects and obtain the ends and advantages mentioned, as well as those inherent therein. The embodiments described herein are not intended as limitations on the scope of the invention.

Brief Description of Drawings

For the purpose of facilitating an understanding of the invention, there is illustrated in the accompanying drawings the preferred embodiments from an inspection of which when considered in connection with the following description, the invention, its construction and operation and many of its advantages would be readily understood and appreciated.

Figure 1 is a general architecture of a system (100) for providing a content list based on a search query received from a user through a search engine, in accordance with one embodiment of the invention.

Figure 2 is a general flow chart of a method for providing a content list based on a search query received from a user through a search engine, in accordance with another embodiment of the invention.

Figure 3 is a general flow chart detailing steps to be performed for enriching a user search query, in accordance with one embodiment of the invention.

Figure 4 is a general flow chart detailing steps to be performed for classifying a plurality of crawled contents, in accordance with one embodiment of the invention.

Figure 5 is a general flow chart detailing steps to be performed for de-duplicating classified contents, according to one embodiment of the invention.

Figure 6 is a general flow chart detailing steps to be performed for presenting de duplicated contents in a template (or templates), according to one embodiment of the invention.

Detailed Description of the Invention

Hereinafter, the invention shall be described according to the preferred embodiments of the invention and by referring to the accompanying description and drawings. However, it is to be understood that limiting the description to the preferred embodiments of the invention and to the drawings is merely to facilitate discussion of the invention and it is envisioned that those skilled in the art may devise various modifications without departing from the scope of the appended claim. The invention provides a system and method for improving retrieval of information including web contents and articles from various websites based on a search query from a user through a search engine. The system and method can also manage the information or contents retrieved from the websites automatically to remove all possible duplicates.

Figure 1 illustrates a general architecture of a system (100) for providing a content list based on a search query received from a user via a search engine, in accordance with a preferred embodiment of the invention. In certain embodiments, the system (100) may communicate with a plurality of databases and websites, as illustrated in Figure 1. The databases may be domain knowledge database (120) for storing domain knowledge and a set of fuzzy logic rules, template database (160) where various template layouts are stored, etc. Depending on the system configuration, these databases can be integral parts of the system (100) or separate components provided to the system (100). In some embodiments, the system (100) may be provided with a rule database for storing a set of fuzzy logic rules, and the rule database is a different database from the domain knowledge database (120).

The system (100) may preferably be provided with several components, such as a data collector (102), data classifier (104), deduplication module (106) and layout module (108). The system (100) may also be configured in some embodiments to enable data transfer between the data collector (102) and data classifier (104).

In certain embodiments, the data collector (102) may be configured to enrich a search query created by a user through a search engine to form an enriched query. In specific, the data collector (102) may communicate with the domain knowledge database (120) and then expand one or a string of search terms (or referred to as“input keywords”) contained in the user search query to other search terms (or referred to as“extended keywords”) by referring to the domain knowledge associated with the search query. Accordingly, the enriched query may contain the input keywords as well as expanded keywords. The data collector (102) may also be configured to crawl and extract contents from a plurality of websites in response to the enriched query. To crawl and extract contents from the websites, it should be appreciated that the data collector (102) may be configured to function as a web crawler in some embodiments, or it may be preferred in some alternative embodiments that the data collector (102) may be provided with a web crawler.

In certain embodiments, the data collector (102) may further be configured to filter the crawled contents based on a keyword white list and a keyword blacklist, both of which may be stored on a keyword database. Further, it should be appreciated that the keyword white list and blacklist may be retrieved based on the enriched query.

It should be appreciated that the domain (or domains) associated with the search query may be identified first, before expanding the keywords in the search query. This is to ensure that the data collector (102) may refer to the appropriate domain knowledge when forming an enriched query. To achieve this objective, the data classifier (104) may preferably be configured to analyse the user search query and then identify one or more domains associated with the query. As mentioned earlier, the user search query may comprise one or a string of input keywords and so, the data classifier (104) is configured to extract these input keywords in order to identify the associated domains. For instance, if the search query comprises an input keyword“MIMOS”, the data classifier (104) may identify that it is associated with the“organisation” domain. Accordingly, if the search query comprises several input keywords, more than one domain may be identified by the data classifier (104). In another word, a single search query may be associated with several domains.

Besides identifying the domains of the search query, the data classifier (104) may also be configured to communicate with the data collector (102) for receiving the crawled contents and then classifying them according to named entities that are extracted from the crawled contents. In particular, the data classifier (104) may perform named entity recognition on each of the crawled contents to extract named entity information (such as person, company, location, etc.) from the content. Based on the named entity information, the data classifier (104) may subsequently classify the crawled contents into a plurality of categories. It should be appreciated that the classified contents may be stored together with respective entity information.

As mentioned in the foregoing, the system (100) may also comprise a deduplication module (106) for subjecting the classified contents received from the data classifier (104) to deduplication, so that the content list can be generated without duplicative contents. In particular, the classified contents of the same category are preferably subjected to the deduplication. In more particular, the deduplication module (106) is configured so that it may communicate with the domain knowledge database (120) where a set of fuzzy logic rules is stored, and select an appropriate deduplication technique based on the category of the classified contents. After selecting a deduplication technique, it may be applied to the classified contents to detect if there is any duplicative content (or referred to as“duplicate”). The deduplication module (106) may also be configured to eliminate the duplicates when they are detected. In certain embodiments, the deduplication module (106) may further be configured to transfer the deduplication results produced by this module (106) to a database for storing thereon.

In some other embodiments, the system (100) may also comprise a layout module (108) that may be configured to populate and present the content list in a template or templates. In more particular, the layout module (108) may retrieve the de-duplicated contents and then populate these contents into a template or templates selected from the template database (160) based on the named entity information of the contents, before displaying or presenting the content list to the user.

In further embodiments, it may be desired to provide a method for providing a content list based on a search query received from a user through a search engine. The method may be performed in certain embodiments by using the system (100) described in the foregoing.

As illustrated in Figure 2, the method may generally comprise several major stages. A user may create a search query through a search engine at the first stage (202), wherein the search query may comprise one or a string of input keywords. Later, at second stage (204), the input keywords may be extracted from the user search query and expanded to form an enriched query. The search engine may subsequently crawl and extract contents from a plurality of websites in response to the enriched query. Upon extraction of the contents associated with the enriched query, they may then be classified into a plurality of categories at third stage (206). At fourth stage (208), the classified contents of the same category may be subjected to deduplication to eliminate or remove undesirable duplicates if detected. Depending on the category of the classified contents, the de-duplicated contents from the fourth stage (208) may be populated and presented in a template at fifth stage (210), and displayed to the user at sixth stage (212).

In some embodiments, it should be appreciated that before expanding the input keywords of the search query in the second stage (204), it may be essential to identify one or more domains associated with the search query. This is to ensure that appropriate domain knowledge is referred when expanding the input keywords to form an enriched query.

In particular, after receiving a search query from a user through a search engine from the first stage (202), the search query may be analysed in order to identify one or more domains associated with the query. In more particular, the search query may comprise one or a string of input keywords, and these input keywords may be extracted from the query to identify the associated domains. For instance, if the search query contains an input keyword“MIMOS”, it may be identified that the input keyword is associated with the“organisation” domain. In other further embodiments, the input keyword may be associated with more than one domain.

Figure 3 is a flow chart illustrating an exemplary second stage (204) in more detail. In particular, after identifying the domains associated with the search query (or the input keywords of the query), the domain information may be used to form an enriched query. In more particular, the search query may be enriched by expanding the input keywords, at Step 2042, based on knowledge of the identified domains. For example, the input keyword“MIMOS” associated with the“organisation” domain may be expanded to“MOSTI”,“research”,“innovation” and other expanded keywords. It should also be appreciated that in some further embodiments, the input and expanded keywords may both be included in the enriched query.

In response to the enriched query, contents may be crawled at Step 2044 from various websites and then extracted at Step 2046. In certain further embodiments, the crawled contents may be filtered based on a keyword white list and a keyword blacklist, where the keyword white list and blacklist may be retrieved or created based on the enriched query.

Figure 4 is a flow chart showing an exemplary third stage (206) in more detail. As the domains associated with the search query have been identified in the preceding steps, domain data properties may be extracted from the identified domain based on the domain knowledge, as at Step 2062. Named entity information may also be extracted from each of the crawled contents, as at Step 2064. At Step 2066, the crawled contents may be classified into a plurality of categories based on the named entity information. Examples of the categories in this invention may include but are not limited to name data, numeric string, number and date. It should also be appreciated that in certain further embodiments, Steps 2062 and 2064 (i.e. extraction of the domain data properties and named entity information) may be performed concurrently. It should further be appreciated that the entity information may be stored as“lookup terms” along with the crawled web contents, in some further embodiments.

Figure 5 is a flow chart illustrating an exemplary fourth stage (208) in more detail, in which the classified contents of the same category may be subjected to deduplication. Different deduplication techniques may also be performed depending on the category of the classified contents. In specific, at Step 2082, an appropriate fuzzy logic rule (or more) may be selected from the domain knowledge database (120) depending on the category into which the crawled contents are classified. For example, if the category of the classified contents relates to“name”, fuzzy logic rules with“Soundex” function may be selected. In another embodiment, if the category of the classified contents relates to“numeric string”, fuzzy logic rules with“edit distance” function may be selected. In still another embodiment, fuzzy logic rules with“number distance” function may be selected if the category of the classified contents relates to“number”, or fuzzy logic rules with“date distance” function if the category of the classified contents relates to“date”. Accordingly, at Step 2084, the selected fuzzy logic rules may be executed and applied to the contents of the same category, in order to detect one or more duplicates. The duplicates may be eliminated or removed upon detection. The de-duplicated contents may then be transferred to a database for storing thereon, as at Step 2086.

Figure 6 is a flow chart showing an exemplary fifth stage (210) in more detail. At Step 2102, a template may be selected and retrieved from a template database (160) based on the category into which the contents are classified. Accordingly, the contents of the same category may be extracted and retrieved from the database, at Step 2104. After populating the selected template with the contents, it may be displayed in a dashboard to the user, as at Step 2106.

In another further embodiment of the invention, the method described in the foregoing may be converted to a series of computer-executable program instructions stored on a non-transitory computer-readable storage medium. When the program instructions are executed by a processing module, it may cause the processing module to perform the steps shown in Figures 2, 3, 4, 5 and 6, thus allowing the search query to be analysed and executed in an automated setting.

The disclosure includes as contained in the appended claims, as well as that of the foregoing description. Although this invention has been described in its preferred form with a degree of particularity, it is understood that the disclosure of the preferred form has been made only by way of example and that numerous changes in the details of construction and the combination and arrangements of parts may be resorted to without departing from the scope of the invention. Example

An example is provided below to illustrate different aspects and embodiments of the invention. The example is not intended in any way to limit the disclosed invention, which is limited only by the claims.

When a user creates a search query comprising an input keyword“MIMOS” on a search engine, it may be identified by the system (100) that the input keyword belongs to an“organisation” domain.

After identifying the domain, the input keyword may be expanded to include other search terms such as“MOSTI”,“research” and“innovation”, wherein the other search terms may collectively be referred to as“expanded keywords”. Contents associated with the input and expanded keywords may be crawled and extracted from a plurality of websites (such as but not limited to social media, news, articles, announcements, blogs and forums).

The knowledge of the organisation domain may be retrieved from the domain knowledge database (120). The organisation domain data properties may optionally be extracted from the domain knowledge database (120), such as business registration number, date of registration, sector, location, board members, shareholders, financial information, etc.

Named entity information (such as person, company, location, etc.) may concurrently be extracted from the crawled web contents. Based on the named entity information, the crawled contents may be classified into various categories, for instance but not limited to nation, world, education, environment, finance, business, sport, technology, lifestyle, video, opinion, etc. In some embodiments, the named entity information may be stored together with the crawled contents.

Subsequently, the classified contents may be subjected to deduplication. In particular, appropriate fuzzy logic rules may be extracted and executed on the contents of the same category. For instance, edit distance algorithm may be used to detect duplicates based on the entity information and category. Upon execution of the fuzzy logic rules, the duplicates and outdated information can be eliminated, thus increasing accuracy of the search results.

Finally, a template may be extracted from the template database (160) and populated with the search results based on the entity information and category.

Claims

1. A method for providing a content list based on a search query received through a search engine, comprising the steps of:

identifying one or more domains associated with the search query;

enriching the search query using domain knowledge of the identified domains; crawling contents from a plurality of websites based on the enriched query; classifying the crawled contents according to named entities extracted from the crawled contents; and

subjecting the classified contents of the same named entity to deduplication, so that the content list can be generated without duplicative contents,

characterised in that the deduplication is performed by selecting a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and subsequently applying the selected technique to the classified contents to detect one or more duplicative contents and remove the duplicative contents when detected.

2. The method according to claim 1, wherein the enriching step is performed by expanding one or a string of input keywords contained in the search query based on domain knowledge.

3. The method according to claim 1 further comprising a step of filtering the crawled contents based on a keyword white list and a keyword blacklist, before the classifying step.

4. The method according to claim 1 further comprising a step of displaying the de-duplicated contents in a template or templates selected based on the named entity of the de-duplicated contents.

5. A system (100) for providing a content list based on a search query received through a search engine, comprising:

a data collector (102) for enriching the search query using domain knowledge and crawling contents from a plurality of websites based on the enriched query;

a data classifier (104) for classifying the crawled contents received from the data collector (102) according to named entities extracted from the crawled contents; and

a deduplication module (106) for subjecting the classified contents of the same named entity to deduplication, in order to generate the content list without duplicative contents,

characterised in that the deduplication module (106) is configured to select a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and then apply the selected technique to the classified contents, so that one or more duplicative contents can be detected and removed when detected.

6. The system (100) according to claim 5, wherein the data classifier (104) is further configured to identify one or more domains associated with the search query, so that the data collector (102) can enrich the search query.

7. The system (100) according to claim 5, wherein the data collector (102) is configured to enrich the search query by expanding one or a string of input keywords contained in the search query based on domain knowledge.

8. The system (100) according to claim 5, wherein the data collector (102) is further configured to filter the crawled contents based on a keyword white list and a keyword blacklist. 9. The system (100) according to claim 5 further comprising a layout module

(108) for displaying de-duplicated contents in a template selected based on the named entity of the de-duplicated contents.