WO2020067870A1 - Method and system for providing a content list based on a search query - Google Patents

Method and system for providing a content list based on a search query Download PDF

Info

Publication number
WO2020067870A1
WO2020067870A1 PCT/MY2019/050061 MY2019050061W WO2020067870A1 WO 2020067870 A1 WO2020067870 A1 WO 2020067870A1 MY 2019050061 W MY2019050061 W MY 2019050061W WO 2020067870 A1 WO2020067870 A1 WO 2020067870A1
Authority
WO
WIPO (PCT)
Prior art keywords
contents
search query
deduplication
crawled
classified
Prior art date
Application number
PCT/MY2019/050061
Other languages
French (fr)
Inventor
May Fern KOH
Thiam Ho SECK
Tong Khin Thong
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2020067870A1 publication Critical patent/WO2020067870A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Definitions

  • the invention relates to information retrieval. More particularly, it relates to retrieving and managing information including web contents in response to a search query.
  • results may also include content items that are identical or nearly identical to each other (i.e. duplicative content items).
  • US Patent No. US 8,214,359 Bl discloses a duplicate detection technique that uses query-relevant information to limit the portion or portions of documents to be compared for similarity. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, the query-relevant information or text (also referred to as“snippets”) is extracted from the documents and only the extract snippets, rather than the entire documents, are compared for purposes of determining similarity.
  • query-relevant information or text also referred to as“snippets”
  • U.S. Pub. No. US 2015/0161267 Al also discloses methods, systems and apparatus, including computer programs encoded on a computer storage medium, for identifying search results that will be provided in response to a search query received from a user device.
  • Two or more search results may reference at least two different resources that are responsive to the search query. It may be determined that the user device will be served a same set of content in response to user interaction with each search result.
  • a replacement search result may be provided in response to the determination, including a reference to a resource serving the same set of content.
  • a search page may be presented to the user that includes the replacement search result and does not include at least one of the identified search results.
  • One of the objects of the invention is to provide a system and method for providing a content list based on a search query received through a search engine in an automated configuration, thereby reducing the time spent on searching and retrieving contents.
  • operations such as crawling, enrichment, categorisation, deduplication, etc. are performed automatically depending on specific information (which may relate to domain information, named entity information or both).
  • Another object of the invention is to provide a system and method that enables intelligent selection of a deduplication technique based on a set of fuzzy logic rules, thereby enhancing the data deduplication and improving accuracy of the deduplication results.
  • Still another object of the invention is to provide a system and method that enables dynamic and automated population of contents associated with the user search query in templates selected based on categories of the contents, before they are presented in a data dashboard to the user.
  • one of the preceding objects is met, in whole or in part, by the invention, in which one of the embodiments of the invention describes a system for providing a content list based on a search query received through a search engine, comprising a data collector for enriching the search query using domain knowledge and crawling contents from a plurality of websites based on the enriched query; a data classifier for classifying the crawled contents received from the data collector according to named entities extracted from the crawled contents; a deduplication module for subjecting the classified contents of the same named entity to deduplication, to generate the content list without duplicative contents; wherein the deduplication module is configured to select a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and then apply the selected technique to the classified contents, so that one or more duplicative contents can be detected and removed when detected.
  • the system may further comprise a layout module for displaying de-duplicated contents in a template (or templates) selected based on the named entity of the de
  • the data collector may be configured to enrich the search query by expanding one or a string of input keywords contained in the search query based on domain knowledge.
  • the data collector may also be further configured to filter the crawled contents based on a keyword white list and a keyword blacklist.
  • the data classifier may be further configured to identify one or more domains associated with the search query, before the data collector may enrich the search query.
  • a further embodiment of the invention is a method for providing a content list based on a search query received from a user through a search engine, comprising the steps of identifying one or more domains associated with the search query; enriching the search query using domain knowledge of the identified domains; crawling contents from a plurality of websites based on the enriched query; classifying the crawled contents according to named entities extracted from the crawled contents; subjecting the classified contents of the same named entity to deduplication, so that the content list can be generated without duplicative contents; wherein the deduplication is performed by selecting a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and subsequently applying the selected technique to the classified contents to detect one or more duplicative contents and remove the duplicative contents when detected.
  • the method may further comprise a step of filtering the crawled contents based on a keyword white list and a keyword blacklist, before the classifying step.
  • the method may further comprise a step of displaying the de-duplicated contents in a template selected based on the named entity of the de-duplicated contents.
  • the enriching step may preferably be performed by expanding one or a string of input keywords contained in the search query based on domain knowledge.
  • Figure 1 is a general architecture of a system (100) for providing a content list based on a search query received from a user through a search engine, in accordance with one embodiment of the invention.
  • Figure 2 is a general flow chart of a method for providing a content list based on a search query received from a user through a search engine, in accordance with another embodiment of the invention.
  • Figure 3 is a general flow chart detailing steps to be performed for enriching a user search query, in accordance with one embodiment of the invention.
  • Figure 4 is a general flow chart detailing steps to be performed for classifying a plurality of crawled contents, in accordance with one embodiment of the invention.
  • Figure 5 is a general flow chart detailing steps to be performed for de-duplicating classified contents, according to one embodiment of the invention.
  • Figure 6 is a general flow chart detailing steps to be performed for presenting de duplicated contents in a template (or templates), according to one embodiment of the invention.
  • the invention provides a system and method for improving retrieval of information including web contents and articles from various websites based on a search query from a user through a search engine.
  • the system and method can also manage the information or contents retrieved from the websites automatically to remove all possible duplicates.
  • Figure 1 illustrates a general architecture of a system (100) for providing a content list based on a search query received from a user via a search engine, in accordance with a preferred embodiment of the invention.
  • the system (100) may communicate with a plurality of databases and websites, as illustrated in Figure 1.
  • the databases may be domain knowledge database (120) for storing domain knowledge and a set of fuzzy logic rules, template database (160) where various template layouts are stored, etc.
  • these databases can be integral parts of the system (100) or separate components provided to the system (100).
  • the system (100) may be provided with a rule database for storing a set of fuzzy logic rules, and the rule database is a different database from the domain knowledge database (120).
  • the system (100) may preferably be provided with several components, such as a data collector (102), data classifier (104), deduplication module (106) and layout module (108).
  • the system (100) may also be configured in some embodiments to enable data transfer between the data collector (102) and data classifier (104).
  • the data collector (102) may be configured to enrich a search query created by a user through a search engine to form an enriched query.
  • the data collector (102) may communicate with the domain knowledge database (120) and then expand one or a string of search terms (or referred to as“input keywords”) contained in the user search query to other search terms (or referred to as“extended keywords”) by referring to the domain knowledge associated with the search query.
  • the enriched query may contain the input keywords as well as expanded keywords.
  • the data collector (102) may also be configured to crawl and extract contents from a plurality of websites in response to the enriched query. To crawl and extract contents from the websites, it should be appreciated that the data collector (102) may be configured to function as a web crawler in some embodiments, or it may be preferred in some alternative embodiments that the data collector (102) may be provided with a web crawler.
  • the data collector (102) may further be configured to filter the crawled contents based on a keyword white list and a keyword blacklist, both of which may be stored on a keyword database. Further, it should be appreciated that the keyword white list and blacklist may be retrieved based on the enriched query.
  • the domain (or domains) associated with the search query may be identified first, before expanding the keywords in the search query. This is to ensure that the data collector (102) may refer to the appropriate domain knowledge when forming an enriched query.
  • the data classifier (104) may preferably be configured to analyse the user search query and then identify one or more domains associated with the query.
  • the user search query may comprise one or a string of input keywords and so, the data classifier (104) is configured to extract these input keywords in order to identify the associated domains. For instance, if the search query comprises an input keyword“MIMOS”, the data classifier (104) may identify that it is associated with the“organisation” domain. Accordingly, if the search query comprises several input keywords, more than one domain may be identified by the data classifier (104). In another word, a single search query may be associated with several domains.
  • the data classifier (104) may also be configured to communicate with the data collector (102) for receiving the crawled contents and then classifying them according to named entities that are extracted from the crawled contents.
  • the data classifier (104) may perform named entity recognition on each of the crawled contents to extract named entity information (such as person, company, location, etc.) from the content. Based on the named entity information, the data classifier (104) may subsequently classify the crawled contents into a plurality of categories. It should be appreciated that the classified contents may be stored together with respective entity information.
  • the system (100) may also comprise a deduplication module (106) for subjecting the classified contents received from the data classifier (104) to deduplication, so that the content list can be generated without duplicative contents.
  • the classified contents of the same category are preferably subjected to the deduplication.
  • the deduplication module (106) is configured so that it may communicate with the domain knowledge database (120) where a set of fuzzy logic rules is stored, and select an appropriate deduplication technique based on the category of the classified contents. After selecting a deduplication technique, it may be applied to the classified contents to detect if there is any duplicative content (or referred to as“duplicate”).
  • the deduplication module (106) may also be configured to eliminate the duplicates when they are detected.
  • the deduplication module (106) may further be configured to transfer the deduplication results produced by this module (106) to a database for storing thereon.
  • the system (100) may also comprise a layout module (108) that may be configured to populate and present the content list in a template or templates.
  • the layout module (108) may retrieve the de-duplicated contents and then populate these contents into a template or templates selected from the template database (160) based on the named entity information of the contents, before displaying or presenting the content list to the user.
  • the method may be performed in certain embodiments by using the system (100) described in the foregoing.
  • the method may generally comprise several major stages.
  • a user may create a search query through a search engine at the first stage (202), wherein the search query may comprise one or a string of input keywords.
  • the input keywords may be extracted from the user search query and expanded to form an enriched query.
  • the search engine may subsequently crawl and extract contents from a plurality of websites in response to the enriched query.
  • they may then be classified into a plurality of categories at third stage (206).
  • the classified contents of the same category may be subjected to deduplication to eliminate or remove undesirable duplicates if detected.
  • the de-duplicated contents from the fourth stage (208) may be populated and presented in a template at fifth stage (210), and displayed to the user at sixth stage (212).
  • the search query may be analysed in order to identify one or more domains associated with the query.
  • the search query may comprise one or a string of input keywords, and these input keywords may be extracted from the query to identify the associated domains. For instance, if the search query contains an input keyword“MIMOS”, it may be identified that the input keyword is associated with the“organisation” domain. In other further embodiments, the input keyword may be associated with more than one domain.
  • Figure 3 is a flow chart illustrating an exemplary second stage (204) in more detail.
  • the domain information may be used to form an enriched query.
  • the search query may be enriched by expanding the input keywords, at Step 2042, based on knowledge of the identified domains.
  • the input keyword“MIMOS” associated with the“organisation” domain may be expanded to“MOSTI”,“research”,“innovation” and other expanded keywords.
  • the input and expanded keywords may both be included in the enriched query.
  • contents may be crawled at Step 2044 from various websites and then extracted at Step 2046.
  • the crawled contents may be filtered based on a keyword white list and a keyword blacklist, where the keyword white list and blacklist may be retrieved or created based on the enriched query.
  • Figure 4 is a flow chart showing an exemplary third stage (206) in more detail.
  • domain data properties may be extracted from the identified domain based on the domain knowledge, as at Step 2062.
  • Named entity information may also be extracted from each of the crawled contents, as at Step 2064.
  • the crawled contents may be classified into a plurality of categories based on the named entity information. Examples of the categories in this invention may include but are not limited to name data, numeric string, number and date. It should also be appreciated that in certain further embodiments, Steps 2062 and 2064 (i.e. extraction of the domain data properties and named entity information) may be performed concurrently. It should further be appreciated that the entity information may be stored as“lookup terms” along with the crawled web contents, in some further embodiments.
  • FIG. 5 is a flow chart illustrating an exemplary fourth stage (208) in more detail, in which the classified contents of the same category may be subjected to deduplication. Different deduplication techniques may also be performed depending on the category of the classified contents.
  • an appropriate fuzzy logic rule (or more) may be selected from the domain knowledge database (120) depending on the category into which the crawled contents are classified. For example, if the category of the classified contents relates to“name”, fuzzy logic rules with“Soundex” function may be selected. In another embodiment, if the category of the classified contents relates to“numeric string”, fuzzy logic rules with“edit distance” function may be selected.
  • fuzzy logic rules with“number distance” function may be selected if the category of the classified contents relates to“number”, or fuzzy logic rules with“date distance” function if the category of the classified contents relates to“date”. Accordingly, at Step 2084, the selected fuzzy logic rules may be executed and applied to the contents of the same category, in order to detect one or more duplicates. The duplicates may be eliminated or removed upon detection. The de-duplicated contents may then be transferred to a database for storing thereon, as at Step 2086.
  • FIG. 6 is a flow chart showing an exemplary fifth stage (210) in more detail.
  • a template may be selected and retrieved from a template database (160) based on the category into which the contents are classified. Accordingly, the contents of the same category may be extracted and retrieved from the database, at Step 2104. After populating the selected template with the contents, it may be displayed in a dashboard to the user, as at Step 2106.
  • the method described in the foregoing may be converted to a series of computer-executable program instructions stored on a non-transitory computer-readable storage medium.
  • the program instructions When executed by a processing module, it may cause the processing module to perform the steps shown in Figures 2, 3, 4, 5 and 6, thus allowing the search query to be analysed and executed in an automated setting.
  • a user When a user creates a search query comprising an input keyword“MIMOS” on a search engine, it may be identified by the system (100) that the input keyword belongs to an“organisation” domain.
  • the input keyword may be expanded to include other search terms such as“MOSTI”,“research” and“innovation”, wherein the other search terms may collectively be referred to as“expanded keywords”.
  • Contents associated with the input and expanded keywords may be crawled and extracted from a plurality of websites (such as but not limited to social media, news, articles, announcements, blogs and forums).
  • the knowledge of the organisation domain may be retrieved from the domain knowledge database (120).
  • the organisation domain data properties may optionally be extracted from the domain knowledge database (120), such as business registration number, date of registration, sector, location, board members, shareholders, financial information, etc.
  • Named entity information (such as person, company, location, etc.) may concurrently be extracted from the crawled web contents. Based on the named entity information, the crawled contents may be classified into various categories, for instance but not limited to nation, world, education, environment, finance, business, sport, technology, lifestyle, video, opinion, etc. In some embodiments, the named entity information may be stored together with the crawled contents.
  • the classified contents may be subjected to deduplication.
  • appropriate fuzzy logic rules may be extracted and executed on the contents of the same category. For instance, edit distance algorithm may be used to detect duplicates based on the entity information and category. Upon execution of the fuzzy logic rules, the duplicates and outdated information can be eliminated, thus increasing accuracy of the search results.
  • a template may be extracted from the template database (160) and populated with the search results based on the entity information and category.

Abstract

The invention provides a method and system for providing a content list based on a search query. The method and system generally require that one or more domains associated with the search query be identified by a data classifier (104), so as to enrich the search query using domain knowledge of the identified domains. Based on the enriched query, contents may be crawled by a data collector (102) from a plurality of websites if they are found to be associated with the enriched query. Later, the crawled contents are classified into various categories based on named entities extracted from the contents. The classified contents are subjected to deduplication, in order to eliminate or remove duplicative contents, before they can be displayed to the user.

Description

METHOD AND SYSTEM FOR PROVIDING A CONTENT LIST
BASED ON A SEARCH QUERY
Field of Invention
The invention relates to information retrieval. More particularly, it relates to retrieving and managing information including web contents in response to a search query.
Background of the Invention
Users frequently locate and retrieve content items (such as files, images, videos, etc.) from various websites by performing searches through one or more search engines. In general, the searches are performed based on one or a string of keywords provided by a user to a search engine that returns results listing content items corresponding to the keywords. However, the results returned by the search engine are often unsatisfactory and thus, more time may be required to locate the relevant content items. In addition, the results may also include content items that are identical or nearly identical to each other (i.e. duplicative content items).
To overcome the above shortcomings, there are numerous methods and systems in the art for optimising information retrieval. For example, US Patent No. US 8,214,359 Bl discloses a duplicate detection technique that uses query-relevant information to limit the portion or portions of documents to be compared for similarity. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, the query-relevant information or text (also referred to as“snippets”) is extracted from the documents and only the extract snippets, rather than the entire documents, are compared for purposes of determining similarity.
U.S. Pub. No. US 2015/0161267 Al also discloses methods, systems and apparatus, including computer programs encoded on a computer storage medium, for identifying search results that will be provided in response to a search query received from a user device. Two or more search results may reference at least two different resources that are responsive to the search query. It may be determined that the user device will be served a same set of content in response to user interaction with each search result. A replacement search result may be provided in response to the determination, including a reference to a resource serving the same set of content. In response to receiving the search query, a search page may be presented to the user that includes the replacement search result and does not include at least one of the identified search results.
Although numerous systems and methods have been provided in the art for optimising information retrieval, they are still unable to generate satisfactory results to the users. For example, the search results are generated based only on the keywords provided by the users, thereby limiting the results to be returned by the search engines. Further, the users are also required to have knowledge about the domain in their search queries, so that more extensive searches can be performed. In addition, the systems and methods in the art are rather rigid and not able to be changed. Therefore, there exists a need to provide a method and system for improving information retrieval.
Summary of the Invention
One of the objects of the invention is to provide a system and method for providing a content list based on a search query received through a search engine in an automated configuration, thereby reducing the time spent on searching and retrieving contents. In particular, operations such as crawling, enrichment, categorisation, deduplication, etc. are performed automatically depending on specific information (which may relate to domain information, named entity information or both).
Another object of the invention is to provide a system and method that enables intelligent selection of a deduplication technique based on a set of fuzzy logic rules, thereby enhancing the data deduplication and improving accuracy of the deduplication results.
Still another object of the invention is to provide a system and method that enables dynamic and automated population of contents associated with the user search query in templates selected based on categories of the contents, before they are presented in a data dashboard to the user.
At least one of the preceding objects is met, in whole or in part, by the invention, in which one of the embodiments of the invention describes a system for providing a content list based on a search query received through a search engine, comprising a data collector for enriching the search query using domain knowledge and crawling contents from a plurality of websites based on the enriched query; a data classifier for classifying the crawled contents received from the data collector according to named entities extracted from the crawled contents; a deduplication module for subjecting the classified contents of the same named entity to deduplication, to generate the content list without duplicative contents; wherein the deduplication module is configured to select a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and then apply the selected technique to the classified contents, so that one or more duplicative contents can be detected and removed when detected. Preferably, the system may further comprise a layout module for displaying de-duplicated contents in a template (or templates) selected based on the named entity of the de-duplicated contents.
In some preferred embodiments, the data collector may be configured to enrich the search query by expanding one or a string of input keywords contained in the search query based on domain knowledge.
The data collector may also be further configured to filter the crawled contents based on a keyword white list and a keyword blacklist.
In some preferred embodiments, the data classifier may be further configured to identify one or more domains associated with the search query, before the data collector may enrich the search query.
A further embodiment of the invention is a method for providing a content list based on a search query received from a user through a search engine, comprising the steps of identifying one or more domains associated with the search query; enriching the search query using domain knowledge of the identified domains; crawling contents from a plurality of websites based on the enriched query; classifying the crawled contents according to named entities extracted from the crawled contents; subjecting the classified contents of the same named entity to deduplication, so that the content list can be generated without duplicative contents; wherein the deduplication is performed by selecting a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and subsequently applying the selected technique to the classified contents to detect one or more duplicative contents and remove the duplicative contents when detected.
The method may further comprise a step of filtering the crawled contents based on a keyword white list and a keyword blacklist, before the classifying step.
The method may further comprise a step of displaying the de-duplicated contents in a template selected based on the named entity of the de-duplicated contents.
In some further embodiments, the enriching step may preferably be performed by expanding one or a string of input keywords contained in the search query based on domain knowledge.
One skilled in the art will readily appreciate that the invention is well adapted to carry out the aspects and obtain the ends and advantages mentioned, as well as those inherent therein. The embodiments described herein are not intended as limitations on the scope of the invention.
Brief Description of Drawings
For the purpose of facilitating an understanding of the invention, there is illustrated in the accompanying drawings the preferred embodiments from an inspection of which when considered in connection with the following description, the invention, its construction and operation and many of its advantages would be readily understood and appreciated.
Figure 1 is a general architecture of a system (100) for providing a content list based on a search query received from a user through a search engine, in accordance with one embodiment of the invention.
Figure 2 is a general flow chart of a method for providing a content list based on a search query received from a user through a search engine, in accordance with another embodiment of the invention.
Figure 3 is a general flow chart detailing steps to be performed for enriching a user search query, in accordance with one embodiment of the invention.
Figure 4 is a general flow chart detailing steps to be performed for classifying a plurality of crawled contents, in accordance with one embodiment of the invention.
Figure 5 is a general flow chart detailing steps to be performed for de-duplicating classified contents, according to one embodiment of the invention.
Figure 6 is a general flow chart detailing steps to be performed for presenting de duplicated contents in a template (or templates), according to one embodiment of the invention.
Detailed Description of the Invention
Hereinafter, the invention shall be described according to the preferred embodiments of the invention and by referring to the accompanying description and drawings. However, it is to be understood that limiting the description to the preferred embodiments of the invention and to the drawings is merely to facilitate discussion of the invention and it is envisioned that those skilled in the art may devise various modifications without departing from the scope of the appended claim. The invention provides a system and method for improving retrieval of information including web contents and articles from various websites based on a search query from a user through a search engine. The system and method can also manage the information or contents retrieved from the websites automatically to remove all possible duplicates.
Figure 1 illustrates a general architecture of a system (100) for providing a content list based on a search query received from a user via a search engine, in accordance with a preferred embodiment of the invention. In certain embodiments, the system (100) may communicate with a plurality of databases and websites, as illustrated in Figure 1. The databases may be domain knowledge database (120) for storing domain knowledge and a set of fuzzy logic rules, template database (160) where various template layouts are stored, etc. Depending on the system configuration, these databases can be integral parts of the system (100) or separate components provided to the system (100). In some embodiments, the system (100) may be provided with a rule database for storing a set of fuzzy logic rules, and the rule database is a different database from the domain knowledge database (120).
The system (100) may preferably be provided with several components, such as a data collector (102), data classifier (104), deduplication module (106) and layout module (108). The system (100) may also be configured in some embodiments to enable data transfer between the data collector (102) and data classifier (104).
In certain embodiments, the data collector (102) may be configured to enrich a search query created by a user through a search engine to form an enriched query. In specific, the data collector (102) may communicate with the domain knowledge database (120) and then expand one or a string of search terms (or referred to as“input keywords”) contained in the user search query to other search terms (or referred to as“extended keywords”) by referring to the domain knowledge associated with the search query. Accordingly, the enriched query may contain the input keywords as well as expanded keywords. The data collector (102) may also be configured to crawl and extract contents from a plurality of websites in response to the enriched query. To crawl and extract contents from the websites, it should be appreciated that the data collector (102) may be configured to function as a web crawler in some embodiments, or it may be preferred in some alternative embodiments that the data collector (102) may be provided with a web crawler.
In certain embodiments, the data collector (102) may further be configured to filter the crawled contents based on a keyword white list and a keyword blacklist, both of which may be stored on a keyword database. Further, it should be appreciated that the keyword white list and blacklist may be retrieved based on the enriched query.
It should be appreciated that the domain (or domains) associated with the search query may be identified first, before expanding the keywords in the search query. This is to ensure that the data collector (102) may refer to the appropriate domain knowledge when forming an enriched query. To achieve this objective, the data classifier (104) may preferably be configured to analyse the user search query and then identify one or more domains associated with the query. As mentioned earlier, the user search query may comprise one or a string of input keywords and so, the data classifier (104) is configured to extract these input keywords in order to identify the associated domains. For instance, if the search query comprises an input keyword“MIMOS”, the data classifier (104) may identify that it is associated with the“organisation” domain. Accordingly, if the search query comprises several input keywords, more than one domain may be identified by the data classifier (104). In another word, a single search query may be associated with several domains.
Besides identifying the domains of the search query, the data classifier (104) may also be configured to communicate with the data collector (102) for receiving the crawled contents and then classifying them according to named entities that are extracted from the crawled contents. In particular, the data classifier (104) may perform named entity recognition on each of the crawled contents to extract named entity information (such as person, company, location, etc.) from the content. Based on the named entity information, the data classifier (104) may subsequently classify the crawled contents into a plurality of categories. It should be appreciated that the classified contents may be stored together with respective entity information.
As mentioned in the foregoing, the system (100) may also comprise a deduplication module (106) for subjecting the classified contents received from the data classifier (104) to deduplication, so that the content list can be generated without duplicative contents. In particular, the classified contents of the same category are preferably subjected to the deduplication. In more particular, the deduplication module (106) is configured so that it may communicate with the domain knowledge database (120) where a set of fuzzy logic rules is stored, and select an appropriate deduplication technique based on the category of the classified contents. After selecting a deduplication technique, it may be applied to the classified contents to detect if there is any duplicative content (or referred to as“duplicate”). The deduplication module (106) may also be configured to eliminate the duplicates when they are detected. In certain embodiments, the deduplication module (106) may further be configured to transfer the deduplication results produced by this module (106) to a database for storing thereon.
In some other embodiments, the system (100) may also comprise a layout module (108) that may be configured to populate and present the content list in a template or templates. In more particular, the layout module (108) may retrieve the de-duplicated contents and then populate these contents into a template or templates selected from the template database (160) based on the named entity information of the contents, before displaying or presenting the content list to the user.
In further embodiments, it may be desired to provide a method for providing a content list based on a search query received from a user through a search engine. The method may be performed in certain embodiments by using the system (100) described in the foregoing.
As illustrated in Figure 2, the method may generally comprise several major stages. A user may create a search query through a search engine at the first stage (202), wherein the search query may comprise one or a string of input keywords. Later, at second stage (204), the input keywords may be extracted from the user search query and expanded to form an enriched query. The search engine may subsequently crawl and extract contents from a plurality of websites in response to the enriched query. Upon extraction of the contents associated with the enriched query, they may then be classified into a plurality of categories at third stage (206). At fourth stage (208), the classified contents of the same category may be subjected to deduplication to eliminate or remove undesirable duplicates if detected. Depending on the category of the classified contents, the de-duplicated contents from the fourth stage (208) may be populated and presented in a template at fifth stage (210), and displayed to the user at sixth stage (212).
In some embodiments, it should be appreciated that before expanding the input keywords of the search query in the second stage (204), it may be essential to identify one or more domains associated with the search query. This is to ensure that appropriate domain knowledge is referred when expanding the input keywords to form an enriched query.
In particular, after receiving a search query from a user through a search engine from the first stage (202), the search query may be analysed in order to identify one or more domains associated with the query. In more particular, the search query may comprise one or a string of input keywords, and these input keywords may be extracted from the query to identify the associated domains. For instance, if the search query contains an input keyword“MIMOS”, it may be identified that the input keyword is associated with the“organisation” domain. In other further embodiments, the input keyword may be associated with more than one domain.
Figure 3 is a flow chart illustrating an exemplary second stage (204) in more detail. In particular, after identifying the domains associated with the search query (or the input keywords of the query), the domain information may be used to form an enriched query. In more particular, the search query may be enriched by expanding the input keywords, at Step 2042, based on knowledge of the identified domains. For example, the input keyword“MIMOS” associated with the“organisation” domain may be expanded to“MOSTI”,“research”,“innovation” and other expanded keywords. It should also be appreciated that in some further embodiments, the input and expanded keywords may both be included in the enriched query.
In response to the enriched query, contents may be crawled at Step 2044 from various websites and then extracted at Step 2046. In certain further embodiments, the crawled contents may be filtered based on a keyword white list and a keyword blacklist, where the keyword white list and blacklist may be retrieved or created based on the enriched query.
Figure 4 is a flow chart showing an exemplary third stage (206) in more detail. As the domains associated with the search query have been identified in the preceding steps, domain data properties may be extracted from the identified domain based on the domain knowledge, as at Step 2062. Named entity information may also be extracted from each of the crawled contents, as at Step 2064. At Step 2066, the crawled contents may be classified into a plurality of categories based on the named entity information. Examples of the categories in this invention may include but are not limited to name data, numeric string, number and date. It should also be appreciated that in certain further embodiments, Steps 2062 and 2064 (i.e. extraction of the domain data properties and named entity information) may be performed concurrently. It should further be appreciated that the entity information may be stored as“lookup terms” along with the crawled web contents, in some further embodiments.
Figure 5 is a flow chart illustrating an exemplary fourth stage (208) in more detail, in which the classified contents of the same category may be subjected to deduplication. Different deduplication techniques may also be performed depending on the category of the classified contents. In specific, at Step 2082, an appropriate fuzzy logic rule (or more) may be selected from the domain knowledge database (120) depending on the category into which the crawled contents are classified. For example, if the category of the classified contents relates to“name”, fuzzy logic rules with“Soundex” function may be selected. In another embodiment, if the category of the classified contents relates to“numeric string”, fuzzy logic rules with“edit distance” function may be selected. In still another embodiment, fuzzy logic rules with“number distance” function may be selected if the category of the classified contents relates to“number”, or fuzzy logic rules with“date distance” function if the category of the classified contents relates to“date”. Accordingly, at Step 2084, the selected fuzzy logic rules may be executed and applied to the contents of the same category, in order to detect one or more duplicates. The duplicates may be eliminated or removed upon detection. The de-duplicated contents may then be transferred to a database for storing thereon, as at Step 2086.
Figure 6 is a flow chart showing an exemplary fifth stage (210) in more detail. At Step 2102, a template may be selected and retrieved from a template database (160) based on the category into which the contents are classified. Accordingly, the contents of the same category may be extracted and retrieved from the database, at Step 2104. After populating the selected template with the contents, it may be displayed in a dashboard to the user, as at Step 2106.
In another further embodiment of the invention, the method described in the foregoing may be converted to a series of computer-executable program instructions stored on a non-transitory computer-readable storage medium. When the program instructions are executed by a processing module, it may cause the processing module to perform the steps shown in Figures 2, 3, 4, 5 and 6, thus allowing the search query to be analysed and executed in an automated setting.
The disclosure includes as contained in the appended claims, as well as that of the foregoing description. Although this invention has been described in its preferred form with a degree of particularity, it is understood that the disclosure of the preferred form has been made only by way of example and that numerous changes in the details of construction and the combination and arrangements of parts may be resorted to without departing from the scope of the invention. Example
An example is provided below to illustrate different aspects and embodiments of the invention. The example is not intended in any way to limit the disclosed invention, which is limited only by the claims.
When a user creates a search query comprising an input keyword“MIMOS” on a search engine, it may be identified by the system (100) that the input keyword belongs to an“organisation” domain.
After identifying the domain, the input keyword may be expanded to include other search terms such as“MOSTI”,“research” and“innovation”, wherein the other search terms may collectively be referred to as“expanded keywords”. Contents associated with the input and expanded keywords may be crawled and extracted from a plurality of websites (such as but not limited to social media, news, articles, announcements, blogs and forums).
The knowledge of the organisation domain may be retrieved from the domain knowledge database (120). The organisation domain data properties may optionally be extracted from the domain knowledge database (120), such as business registration number, date of registration, sector, location, board members, shareholders, financial information, etc.
Named entity information (such as person, company, location, etc.) may concurrently be extracted from the crawled web contents. Based on the named entity information, the crawled contents may be classified into various categories, for instance but not limited to nation, world, education, environment, finance, business, sport, technology, lifestyle, video, opinion, etc. In some embodiments, the named entity information may be stored together with the crawled contents.
Subsequently, the classified contents may be subjected to deduplication. In particular, appropriate fuzzy logic rules may be extracted and executed on the contents of the same category. For instance, edit distance algorithm may be used to detect duplicates based on the entity information and category. Upon execution of the fuzzy logic rules, the duplicates and outdated information can be eliminated, thus increasing accuracy of the search results.
Finally, a template may be extracted from the template database (160) and populated with the search results based on the entity information and category.

Claims

Claims
1. A method for providing a content list based on a search query received through a search engine, comprising the steps of:
identifying one or more domains associated with the search query;
enriching the search query using domain knowledge of the identified domains; crawling contents from a plurality of websites based on the enriched query; classifying the crawled contents according to named entities extracted from the crawled contents; and
subjecting the classified contents of the same named entity to deduplication, so that the content list can be generated without duplicative contents,
characterised in that the deduplication is performed by selecting a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and subsequently applying the selected technique to the classified contents to detect one or more duplicative contents and remove the duplicative contents when detected.
2. The method according to claim 1, wherein the enriching step is performed by expanding one or a string of input keywords contained in the search query based on domain knowledge.
3. The method according to claim 1 further comprising a step of filtering the crawled contents based on a keyword white list and a keyword blacklist, before the classifying step.
4. The method according to claim 1 further comprising a step of displaying the de-duplicated contents in a template or templates selected based on the named entity of the de-duplicated contents.
5. A system (100) for providing a content list based on a search query received through a search engine, comprising:
a data collector (102) for enriching the search query using domain knowledge and crawling contents from a plurality of websites based on the enriched query;
a data classifier (104) for classifying the crawled contents received from the data collector (102) according to named entities extracted from the crawled contents; and
a deduplication module (106) for subjecting the classified contents of the same named entity to deduplication, in order to generate the content list without duplicative contents,
characterised in that the deduplication module (106) is configured to select a deduplication technique from a set of fuzzy logic rules based on the named entity of the classified contents and then apply the selected technique to the classified contents, so that one or more duplicative contents can be detected and removed when detected.
6. The system (100) according to claim 5, wherein the data classifier (104) is further configured to identify one or more domains associated with the search query, so that the data collector (102) can enrich the search query.
7. The system (100) according to claim 5, wherein the data collector (102) is configured to enrich the search query by expanding one or a string of input keywords contained in the search query based on domain knowledge.
8. The system (100) according to claim 5, wherein the data collector (102) is further configured to filter the crawled contents based on a keyword white list and a keyword blacklist. 9. The system (100) according to claim 5 further comprising a layout module
(108) for displaying de-duplicated contents in a template selected based on the named entity of the de-duplicated contents.
PCT/MY2019/050061 2018-09-28 2019-09-20 Method and system for providing a content list based on a search query WO2020067870A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2018001658 2018-09-28
MYPI2018001658 2018-09-28

Publications (1)

Publication Number Publication Date
WO2020067870A1 true WO2020067870A1 (en) 2020-04-02

Family

ID=69949497

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2019/050061 WO2020067870A1 (en) 2018-09-28 2019-09-20 Method and system for providing a content list based on a search query

Country Status (1)

Country Link
WO (1) WO2020067870A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231554A (en) * 2020-10-10 2021-01-15 腾讯科技(深圳)有限公司 Search recommendation word generation method and device, storage medium and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020099723A1 (en) * 2000-01-14 2002-07-25 Jorge Garcia-Chiesa Apparatus and method to support management of uniform resource locators and/or contents of database servers
US20080294616A1 (en) * 2007-05-21 2008-11-27 Data Trace Information Services, Llc System and method for database searching using fuzzy rules
US9569527B2 (en) * 2007-06-22 2017-02-14 Google Inc. Machine translation for query expansion
US20170220694A1 (en) * 2013-01-31 2017-08-03 Google Inc. Canonicalized online document sitelink generation
US20180067940A1 (en) * 2016-09-06 2018-03-08 Kakao Corp. Search method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020099723A1 (en) * 2000-01-14 2002-07-25 Jorge Garcia-Chiesa Apparatus and method to support management of uniform resource locators and/or contents of database servers
US20080294616A1 (en) * 2007-05-21 2008-11-27 Data Trace Information Services, Llc System and method for database searching using fuzzy rules
US9569527B2 (en) * 2007-06-22 2017-02-14 Google Inc. Machine translation for query expansion
US20170220694A1 (en) * 2013-01-31 2017-08-03 Google Inc. Canonicalized online document sitelink generation
US20180067940A1 (en) * 2016-09-06 2018-03-08 Kakao Corp. Search method and apparatus

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231554A (en) * 2020-10-10 2021-01-15 腾讯科技(深圳)有限公司 Search recommendation word generation method and device, storage medium and computer equipment
CN112231554B (en) * 2020-10-10 2023-10-31 腾讯科技(深圳)有限公司 Search recommended word generation method and device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
US10445359B2 (en) Method and system for classifying media content
US9323738B2 (en) Classification of ambiguous geographic references
US7523099B1 (en) Category suggestions relating to a search
CN103631948B (en) Identifying method of named entities
US9619571B2 (en) Method for searching related entities through entity co-occurrence
US20060161545A1 (en) Method and apparatus for ordering items within datasets
US8788503B1 (en) Content identification
MX2013005056A (en) Multi-modal approach to search query input.
US20060288038A1 (en) Generation of a blended classification model
JP2000276484A5 (en) Image search device, image search method
US8423885B1 (en) Updating search engine document index based on calculated age of changed portions in a document
US10140297B2 (en) Supplementing search results with information of interest
US20080059432A1 (en) System and method for database indexing, searching and data retrieval
US20100138414A1 (en) Methods and systems for associative search
WO2020067870A1 (en) Method and system for providing a content list based on a search query
US10671810B2 (en) Citation explanations
JPH09223150A (en) Information classification processing method
Freire et al. Identification of FRBR works within bibliographic databases: An experiment with UNIMARC and duplicate detection techniques
Asadi et al. Pattern-based extraction of addresses from web page content
US20150046437A1 (en) Search Method
JP2002183195A (en) Concept retrieving system
AU2021100441A4 (en) A method of text mining in ranking of web pages using machine learning
JP5346045B2 (en) Document search apparatus, document search method, and document search program
Alqaraleh et al. Utilizing Query by Example for Fast and Accurate Multimedia Retrieval
Shinde et al. Retrieval of efficiently classified, re-ranked images using histogram based score computation algorithm extended with the elimination of duplicate images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19867159

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19867159

Country of ref document: EP

Kind code of ref document: A1