WO2015076662A1 - A system and method for predicting query in a search engine - Google Patents

A system and method for predicting query in a search engine Download PDF

Info

Publication number
WO2015076662A1
WO2015076662A1 PCT/MY2014/000179 MY2014000179W WO2015076662A1 WO 2015076662 A1 WO2015076662 A1 WO 2015076662A1 MY 2014000179 W MY2014000179 W MY 2014000179W WO 2015076662 A1 WO2015076662 A1 WO 2015076662A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
keywords
based
classifications
historical data
Prior art date
Application number
PCT/MY2014/000179
Other languages
French (fr)
Inventor
Bin Mat Nor FAZLI
Ysrin Bin Amruddin AMRU
Bin Mohammad Ali MOHAMMAD AZAM
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to MYPI2013702212 priority Critical
Priority to MYPI2013702212 priority
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015076662A1 publication Critical patent/WO2015076662A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions

Abstract

The present invention relates to a system (100) and method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications. The system (100) comprises of a client server (101), a query server (102) and a database (103). The query server (102) further comprises of a query manager (104), a historical data generator (105) and a data mining engine (106). The system (100) provides predicted query or keywords based on users past experience which are trackable by the data mining engine (106). Thus, the system (100) is capable of improving search efficiency based on its predicted queries.

Description

A SYSTEM AND METHOD FOR PREDICTING QUERY IN A SEARCH ENGINE

FIELD OF INVENTION

The present invention relates to a system and method for predicting query in a search engine. More particularly, the present invention relates to a system and method for predicting query in a search engine based on a similarity of an input query with historical data of keywords and classifications.

BACKGROUND OF THE INVENTION

A query search is usually processed based on the entered keywords by users. However, sometimes it is also processed by using predicted keywords or queries which are based on similarity, semantics or popularity. These predicted keywords and queries provide a better understanding on the information to the users. For a predicted query, users usually will have to select only one query from the predicted list and a search engine will then process the query before returning a set of results on the query.

An example of above mentioned search engine is disclosed in United States Patent Publication No. 2007/0239703 A1 which relates to a system and method for generating forecasts of keyword search by providing one or more seasonal categories. After determining a category to the keywords, the system generates a forecast of a keyword search volume for one or more keywords having a seasonal correlation value greater than or equal to a predetermined threshold. Another United Stated Patent Publication No. 2004/0049499 also discloses a search engine, wherein the system extracts keywords from a query and it similarly classifies the keywords into a classification type for the system to retrieve the results. These retrieved results are then ranked in order of similarity based on the classification result.

These types of existing search engines usually use a set of training data or mining data to predict the next keyword that might be useful to the user. However, the mining data used is purely from raw historical data which is logged based on the user previous query. The data has not been enhanced or enriched with value added information. Consequently, the predicted queries which are returned by the search engine are irrelevant and inaccurate, hence returning search results which are also inaccurate. Therefore, there is a need to provide a system and method that can address the above mentioned drawbacks of the existing search engines. SUMMARY OF INVENTION

The present invention relates to a system (100) and method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications. The system (100) comprises of at least one client server (101); a query server (102); and a database (103) to store prediction rules. The query server (102) further comprises of a query manager (104) for processing queries submitted by the at least one client server (101) by extracting keywords and keywords classifications; a data mining engine (106) for providing query patterns; and a historical data generator (105) for updating historical data set based on the extracted keywords and keywords.

The method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications is characterised by the steps of submitting the input query to a query server (102); extracting keywords from the query; performing query classifications on the extracted keywords; and performing a query executing process and performing an updating process based on the extracted keywords and classifications.

Preferably, the step of performing query classifications is based on the definition of the query by using named entity recognition and lexical database.

Preferably, the query executing process includes the steps of obtaining associated keywords by looking up query prediction rules which are stored in a database (103); executing query based on extracted keywords and associated keywords; and sending results to user.

Preferably, the updating process includes the steps of retrieving keywords and their classifications based on the query; separating each keyword and its classifications; combining keywords and classifications; storing all combinations as mining data; executing mining process based on the updated historical data to generate predictive relationships between the keywords; and storing prediction rules in a database (103).

Preferably, the step of combining keywords and classifications further includes the steps of selecting at least one keyword to be searched based on the query; selecting at least one classification for the at least one keyword; retrieving a list of other keywords based on the query; retrieving a list of related words for the other keywords based on the query; retrieving a list of related words of the at least one classification for the at least one keyword; determining similarity measurement between the list of related words for other keywords and the list of related words of the at least one classification for the at least one keyword using cosine similarity; and selecting combination of classification and other keywords.

Preferably, the list of related words for other keywords and the list of related words of the at least one classification for the at least on keyword are retrieved using a reverse dictionary.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.

FIG. 1 illustrates a block diagram of a system (100) for predicting query in a search engine according to an embodiment of the present invention.

FIG. 2 illustrates a flowchart of an overall process for predicting query in a search engine according to an embodiment of the present invention.

FIG. 3 illustrates a flowchart of a method to update historical data by a historical data generator (106) of the system (100) of FIG. 1.

FIG. 4 illustrates a flowchart of a method to combine the keywords and classifications by the historical data generator (106) of the system (100) of FIG. 1. FIG. 5 illustrates an example of a process of selecting the highest similarity of related keywords and classifications based on the method of FIG. 4.

DESCRIPTION OF THE PREFFERED EMBODIMENT

A preferred embodiment of the present invention will be described herein below with reference to the accompanying drawings. In the following description, well known functions or constructions are not described in detail since they would obscure the description with unnecessary detail.

Reference is made initially to FIG. 1 which illustrates a system (100) for predicting query in a search engine according to an embodiment of the present invention. The system (100) comprises of a client server (101), a query server (102) and a database (103). The client server (101) is connected to the query server (102), while the query server (102) is connected to the client server (101) and the database (103). The query server (102) comprises of a query manager (104), a historical data generator (105) and a data mining engine (106).

The client server (101) which can either be a server or a client device such as a laptop or a mobile device, is used to submit queries to the query server (102) while the database (103) is used to store all prediction rules. The function of the query manager (104) is to process the queries by extracting keywords and performing query classifications based on their definition by using named entity recognition and lexical database. The historical data generator (105) updates historical data set based on the extracted keywords and keywords classifications, while the data mining engine (106) provides query patterns by looking at user's past experience. This past experience includes the previous query keywords used by the user and the classifications of the keywords based on named entity recognition and lexical database. Since the system (100) provides predicted queries or keywords based on user's past experience which are trackable by the data mining engine (106), the system (100) is capable of improving search efficiency based on its predicted queries. The method to predict queries is further explained in FIG. 2 which illustrates a flowchart of an overall process for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications according to an embodiment of the present invention.

Initially, a user submits an input query from the client server (101) to the query server (102) as in step 201 , wherein the query server (102) passes the query to the query manager (104). The query manager (104) then processes the query by extracting keywords and performing query classifications based on the definition of the query by using named entity recognition and lexical database as in step 202. The process then splits into two separate processes which are a query executing process (the query executing process is referred herein below as process A) and an updating process (the updating process is referred herein below as process B). These two processes are executed simultaneously.

For process A, the query manager (104) obtains a plurality of associated keywords by searching from the previously stored query prediction rules in the database (103) as in step 203. For example, if the previous query entered by user as "Najib Tun Razak Prime Minister Malaysia," the keyword classification result is based on named entity recognition and lexical database, wherein "Najib Tun Razak" is classified as a Person, "Prime Minister" is classified as a Position; and "Malaysia" is classified as a Country. Thereon, the data mining engine (106) generates query prediction rules which comprise of {Najib Tun Razak) = {Najib Tun Razak, Person, Position, Country}; {Prime Minister} = {Prime Minister, Person, Position, Country}; and {Malaysia} = {Malaysia, Person, Position, Country). The process continues by executing the query as in step 204 based on the keywords and associated keywords found in steps 203 and 204 respectively. Once the results of the query have been gathered, the results are sent back to the client server (101) as in step 205. On the other hand, process B starts by updating historical data set as in step

300 based on the extracted keywords and keyword classifications by the historical data generator (105). Once the historical data is populated by combining the keywords and the classifications from past and recent queries, the mining process is then executed to generate predictive relationships or associations between keywords by the data mining engine (106) as in step 400. Referring to previous example, if a user key in a new query keyword of "Malaysia Position," the data mining engine (106) uses association rules to predict new keywords based on the prediction rules explained above. An example of the association rules is {Malaysia Position} = {Najib Tun Razak, Person, Country}. Although the user does not mention "Najib Tun Razak" in his recent query, "Najib Tun Razak" is associated to the new query based on the user's past experience or past query. Finally, these relationship or association rules are stored in the database (103) for future consumption as in step 500.

Referring now to FIG. 3, it shows a flow chart of the method to update historical data by the historical data generator (105) as in step 300. Initially, the historical data generator (105) retrieves the keywords and classifications as in step 310 which are extracted previously from the user's past queries from step 202. Next, the historical data generator (105) separates the current keywords based on their classifications as in step 330 to get the keywords' current classifications. It is similar to the query prediction rules as explained in process A, wherein the keywords classification results are based on named entity recognition and lexical database. The historical data generator (105) then combines each keyword with its relevant classification as in step 350 to combine the user's current query keywords with the previous keywords and classifications from user's past queries. The relevant classification here is based on user's past experience. For example, if user's current keyword is "Prime Minister," the relevant classification is "Najib Tun Razak." Although the user does not mention "Najib Tun Razak' in his current query, "Najib Tun Razak' is associated to the new query based on the user's past experience or past query. Once all the combinations are generated, the historical data generator (105) then stores the combination as historical data in the database (103) as in step 370 for the data mining engine (106) to discover the potential predictive relationships between keywords as in step 400. The step of combining the keywords and classifications has an advantage over the existing search engines as it provides a resource of extracting predictive relationships between words in producing an enhanced keyword search.

A detailed process of step 350 in FIG. 3 is shown in FIG. 4, wherein it illustrates a flowchart of the method to combine the keywords and classifications by the historical data generator (106). Initially, the historical data generator (105) selects a keyword Kx starting with the first keyword from a set of keywords {Κ, ., .,Κη} as in step 351. This set of keywords are generated after the user's current query keywords are separated and classified in step 330, wherein referring to the previous example, after "Najib Tun Razak' is classified as a Person; "Prime Minister" is classified as a Position; and "Malaysia" is classified as a Country, the set of keywords is {Najib Tun Razak, Prime Minister, Malaysia}.

If the historical data generator (105) has finished selecting all keywords from the set of keywords during this process as in decision 352, the historical data generator (105) returns a collection of HKx i.e. {{HKi} {HKn}} as in step 363, wherein HKx is a collection Hm, and wherein Hm is of a combination of keywords and their relevant classifications. On the other hand, if there are more keywords to be selected from the set of keywords, the historical data generator (105) retrieves a list of other keywords, Dm as in decision 352 and step 353, wherein Dm is a list of all keywords except the one chosen in step 351. After retrieving the list of other keywords, Dm, the historical data generator

(105) then retrieves a list of related words for each keyword in list Dm as in step 354 by using a reverse dictionary, wherein the list of related words is represented by WDm. Next, the historical data generator (105) selects a classification Cm of keyword Kx retrieved from step 202 starting with the first classification as in step 355. Referring to the previous example, the set of classification associated to the keyword "Najib Tun Razak' is {Person}. However, keyword's classification can also be more than one, wherein for this example, besides a "Person," "Najib Tun Razak" can also be a "Malay Name" and a "Politician." If the historical data generator (105) has finished selecting all the classifications of Kx as in decision 356, it returns the set of Hm for keyword Kx i.e. HKx as in step 362, wherein Hm is a combination of keyword and relevant classification and HKx is a collection of Hm. On the other hand, if there are more classifications to be selected, the historical data generator (105) continues to retrieve a list of related words of Cm by using a reverse dictionary as in decision 356 and step 357, wherein the list is represented by WCm.

The historical data generator (105) then selects a set of related words WDm starting with the first set of related words based on the list of related words in WDm as in step 358. The process continues by calculating the similarity Vi between WCm and WDm by using the cosine similarity until all the set of related words from WDm have been selected as in decision 359 and step 360. The steps of selecting the set of related words WDm and calculating the similarity of WCm and WDm are repeated until there are no more related words to be selected. Once these steps are completed, the historical data generator (105) returns the set of combination Hm for each keyword, wherein the combination of Cm with other keywords is selected with the combination of {Kx, Cm, Dm} based on the highest value of similarity as in decision 359 and step 361. Once each of the keywords is processed from step 351 to step 361 , the result of set Hm is stored as HKx as in 362. Thereon, the set of HKx is transferred to step 370 to be stored in the database as historical data, which is used for the updating process of process B.

Referring now to FIG. 5, it illustrates an example of a process of selecting the highest similarity of related keywords and classifications as in step 351 to step 363 of FIG. 4 according to an embodiment of the present invention. Initially, the historical data generator (105) selects a keyword Kx starting with the first keyword from a set of keywords {Κ, .,.,Κη} as in step 351. The set of 3 keywords {K1, K2, K3} is represented as {A, B, C}. This set of keywords is generated after the user's current query keywords are separated and classified in step 330. For example, if the 3 keywords from a current query are "Najib Tun Razak;" "Prime Minister," and "Malaysia," A represents "Najib Tun Razak " B represents "Prime Minister," and C represents "Malaysia."

After the historical data generator (105) selects the first keyword K1 i.e. A from the set of keywords {A, B, C} as in step 351, the historical data generator (105) retrieves a list of other keywords, Dm as in decision 352 and step 353, wherein Dm is {B, C}. Next, the historical data generator (105) retrieves a list of related words for each keyword in list Dm as in step 354 by using a reverse dictionary, if there are more keywords to be selected from the set of keywords wherein the list of related words is represented by WDm, and wherein WDm is represented by the small dots in the circles related to B and C.

Next, the historical data generator (105) selects a classification, Cm of keyword A retrieved from step 202 starting with the first classification as in step 355. The example shows that there are 3 classifications {C1, C2, C3} of A which are represented as A1, A2 and A3 respectively, with the first classification C1 is A1. For example if the 3 classifications of "Najib Tun RazaW are "Person," "Malay Name " and "Politician," A1 represents "Person;" A2 represents "Malay Name " and A3 represents "Politician." The historical data generator (105) continues to retrieve a list of related words of each keywords in Cm by using a reverse dictionary as in decision 356 and step 357, if there are more classifications to be selected wherein the list is represented by WCm i.e. {WC1, WC2, WC3} and wherein WCm is represented by the small dots in the circles related to A1, A2 and A3. In other words, WC1 is a collection of keywords related to A1; WC2 is a collection of keywords related to A2; and WC3 is a collection of keywords related to A3.

As in step 358, the historical data generator (105) then selects a set of related words WDm starting with the first set of related words based on the list of related words in WDm i.e. the small dots in the circles related to B. The process continues by calculating the similarity Vi between WCm i.e. {WC1, WC2, WC3), which are the collections of keywords related to A1, A2 and A3 respectively represented by the small dots in the circles related to A1, A2 and A3 and WDm i.e. {WD1, WD2) which are the collections of related words related to keyword β and keyword C respectively represented by the small dots in the circles related to B and C. The similarity measurement is calculated by using the cosine similarity until all the set of related words from WDm have been selected as in decision 359 and step 360. Numbers such as 0.8, 0.5, 0.2, 0.6 and etc. represent the similarity Vi between WCm and WDm. For example, the similarity Vi between the first set of related words WDm i.e. the small dots in the circles related to B, and the first keyword's first classification i.e. the small dots in the circles related to A1 is shown as 0.8. The steps of selecting the set of related words WDm and calculating the similarity of WCm and WDm are repeated until there are no more related words to be selected. The steps of calculating the similarity between the classification of the first keyword i.e. A, with other keywords i.e. B and C are shown in the first row, Row 1.

Once these steps are completed, the historical data generator (105) returns the set of combination Hm for each keyword, wherein the combination of Cm with other keywords is selected with the combination of {Kx, Cm, Dm} based on the highest value of similarity as in decision 359 and step 361. The combination Hm of FIG. 5 is shown as output H1 = {A, At, B}, H2 = {A, A2, C} and H3 = {A, A3, B}. These outputs can be translated as H1 = {'Najib Tun RazaK', "Person", "Prime Minister"}; H2 = {"Najib Tun RazaK "Malay Name", "Malaysia"}; and H3 = {'Najib Tun RazakT, "Politician", "Prime Minister"}. Similarly, the same process is done from step 351 to step 361 on other keywords i.e. B and C to get the combination {Kx, Cm, Dm}. Once each of the keywords is processed from step 351 to step 361 , the result of set Hm is stored as HKx as in 362. Thereon, the set of HKx is transferred to step 370 to be stored in the database as historical data, which is used for the updating process of process B.

While embodiments of the invention have been illustrated and described, it is not intended that these embodiments illustrate and describe all possible forms of the invention. Rather, the words used in the specifications are words of description rather than limitation and various changes may be made without departing from the scope of the invention.

Claims

A system (100) for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications comprises of:
a) at least one client server (101);
b) a query server (102); and
c) a database (103),
characterised in that the query server (102) further comprises of:
i. a query manager (104) for processing queries submitted by the at least one client server (101) by extracting keywords and keywords classifications;
ii. a data mining engine (106) for providing query patterns; and iii. a historical data generator (105) for updating historical data set based on the extracted keywords and keywords.
A method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications is characterised by the steps of:
a) submitting the input query to a query server (102);
b) extracting keywords from the query;
c) performing query classifications on the extracted keywords; and d) performing a query executing process and performing an updating process based on the extracted keywords and classifications. 3. The method as claimed in claim 2, wherein the step of performing query classifications is based on the definition of the query by using named entity recognition and lexical database.
4. The method as claimed in claim 2, wherein the query executing process includes the steps of:
a) obtaining associated keywords by looking up query prediction rules which are stored in a database (103);
b) executing query based on extracted keywords and associated keywords; and
c) sending results to user. The method as claimed in claim 2, wherein the updating process includes the steps of:
a) retrieving keywords and their classifications based on the query;
b) separating each keyword and its classifications;
c) combining keywords and classifications;
d) storing all combinations as mining data;
e) executing mining process based on the updated historical data to generate predictive relationships between the keywords; and f) storing prediction rules in a database (103).
The method as claimed in claim 5, wherein the step of combining keywords and classifications further includes the steps of:
a) selecting at least one keyword to be searched based on the query; b) selecting at least one classification for the at least one keyword;
c) retrieving a list of other keywords based on the query;
d) retrieving a list of related words for the other keywords based on the query;
e) retrieving a list of related words of the at least one classification for the at least one keyword;
f) determining similarity measurement between the list of related words for other keywords and the list of related words of the at least one classification for the at least one keyword using cosine similarity; and g) selecting combination of classification and other keywords.
The method as claimed in claim 6, wherein the list of related words for the other keywords based on the query is retrieved using a reverse dictionary.
The method as claimed in claim 6, wherein the list of related words of the at least one classification for the at least one keyword is retrieved using a reverse dictionary.
PCT/MY2014/000179 2013-11-20 2014-06-12 A system and method for predicting query in a search engine WO2015076662A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
MYPI2013702212 2013-11-20
MYPI2013702212 2013-11-20

Publications (1)

Publication Number Publication Date
WO2015076662A1 true WO2015076662A1 (en) 2015-05-28

Family

ID=51703369

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2014/000179 WO2015076662A1 (en) 2013-11-20 2014-06-12 A system and method for predicting query in a search engine

Country Status (1)

Country Link
WO (1) WO2015076662A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017095510A1 (en) * 2015-11-30 2017-06-08 Intel Corporation Multi-scale computer vision

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6006225A (en) * 1998-06-15 1999-12-21 Amazon.Com Refining search queries by the suggestion of correlated terms from prior searches

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6006225A (en) * 1998-06-15 1999-12-21 Amazon.Com Refining search queries by the suggestion of correlated terms from prior searches

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017095510A1 (en) * 2015-11-30 2017-06-08 Intel Corporation Multi-scale computer vision

Similar Documents

Publication Publication Date Title
US8046363B2 (en) System and method for clustering documents
JP4569955B2 (en) Information storage and retrieval method
KR101669191B1 (en) Identifying query aspects
US7392238B1 (en) Method and apparatus for concept-based searching across a network
US9582608B2 (en) Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
JP4906846B2 (en) Scoring user compatibility in social networks
US8145703B2 (en) User interface and method in a local search system with related search results
US8326842B2 (en) Semantic table of contents for search results
Batsakis et al. Improving the performance of focused web crawlers
RU2382400C2 (en) Construction and application of web-catalogues for focused search
JP5632124B2 (en) Rating method, search result sorting method, rating system, and search result sorting system
US20140025664A1 (en) Identifying terms associated with queries
US8812534B2 (en) Machine assisted query formulation
US8019748B1 (en) Web search refinement
US8965872B2 (en) Identifying query formulation suggestions for low-match queries
KR101700352B1 (en) Generating improved document classification data using historical search results
US7809721B2 (en) Ranking of objects using semantic and nonsemantic features in a system and method for conducting a search
TWI525458B (en) Recommended methods and devices for searching for keywords
US20040249808A1 (en) Query expansion using query logs
US8260809B2 (en) Voice-based search processing
US20090132953A1 (en) User interface and method in local search system with vertical search results and an interactive map
US9171078B2 (en) Automatic recommendation of vertical search engines
JP2005302042A (en) Term suggestion for multi-sense query
US20030167263A1 (en) Adaptive information-retrieval system
KR20090007626A (en) Method for domain identification of documents in a document database

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14784121

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14784121

Country of ref document: EP

Kind code of ref document: A1