WO2015076662A1

WO2015076662A1 - A system and method for predicting query in a search engine

Info

Publication number: WO2015076662A1
Application number: PCT/MY2014/000179
Authority: WO
Inventors: Bin Mat Nor FAZLI; Ysrin Bin Amruddin AMRU; Bin Mohammad Ali MOHAMMAD AZAM
Original assignee: Mimos Berhad
Priority date: 2013-11-20
Filing date: 2014-06-12
Publication date: 2015-05-28
Also published as: MY168793A

Abstract

The present invention relates to a system (100) and method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications. The system (100) comprises of a client server (101), a query server (102) and a database (103). The query server (102) further comprises of a query manager (104), a historical data generator (105) and a data mining engine (106). The system (100) provides predicted query or keywords based on users past experience which are trackable by the data mining engine (106). Thus, the system (100) is capable of improving search efficiency based on its predicted queries.

Description

A SYSTEM AND METHOD FOR PREDICTING QUERY IN A SEARCH ENGINE

FIELD OF INVENTION

The present invention relates to a system and method for predicting query in a search engine. More particularly, the present invention relates to a system and method for predicting query in a search engine based on a similarity of an input query with historical data of keywords and classifications.

BACKGROUND OF THE INVENTION

A query search is usually processed based on the entered keywords by users. However, sometimes it is also processed by using predicted keywords or queries which are based on similarity, semantics or popularity. These predicted keywords and queries provide a better understanding on the information to the users. For a predicted query, users usually will have to select only one query from the predicted list and a search engine will then process the query before returning a set of results on the query.

An example of above mentioned search engine is disclosed in United States Patent Publication No. 2007/0239703 A1 which relates to a system and method for generating forecasts of keyword search by providing one or more seasonal categories. After determining a category to the keywords, the system generates a forecast of a keyword search volume for one or more keywords having a seasonal correlation value greater than or equal to a predetermined threshold. Another United Stated Patent Publication No. 2004/0049499 also discloses a search engine, wherein the system extracts keywords from a query and it similarly classifies the keywords into a classification type for the system to retrieve the results. These retrieved results are then ranked in order of similarity based on the classification result.

These types of existing search engines usually use a set of training data or mining data to predict the next keyword that might be useful to the user. However, the mining data used is purely from raw historical data which is logged based on the user previous query. The data has not been enhanced or enriched with value added information. Consequently, the predicted queries which are returned by the search engine are irrelevant and inaccurate, hence returning search results which are also inaccurate. Therefore, there is a need to provide a system and method that can address the above mentioned drawbacks of the existing search engines. SUMMARY OF INVENTION

The present invention relates to a system (100) and method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications. The system (100) comprises of at least one client server (101); a query server (102); and a database (103) to store prediction rules. The query server (102) further comprises of a query manager (104) for processing queries submitted by the at least one client server (101) by extracting keywords and keywords classifications; a data mining engine (106) for providing query patterns; and a historical data generator (105) for updating historical data set based on the extracted keywords and keywords.

The method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications is characterised by the steps of submitting the input query to a query server (102); extracting keywords from the query; performing query classifications on the extracted keywords; and performing a query executing process and performing an updating process based on the extracted keywords and classifications.

Preferably, the step of performing query classifications is based on the definition of the query by using named entity recognition and lexical database.

Preferably, the query executing process includes the steps of obtaining associated keywords by looking up query prediction rules which are stored in a database (103); executing query based on extracted keywords and associated keywords; and sending results to user.

Preferably, the updating process includes the steps of retrieving keywords and their classifications based on the query; separating each keyword and its classifications; combining keywords and classifications; storing all combinations as mining data; executing mining process based on the updated historical data to generate predictive relationships between the keywords; and storing prediction rules in a database (103).

Preferably, the step of combining keywords and classifications further includes the steps of selecting at least one keyword to be searched based on the query; selecting at least one classification for the at least one keyword; retrieving a list of other keywords based on the query; retrieving a list of related words for the other keywords based on the query; retrieving a list of related words of the at least one classification for the at least one keyword; determining similarity measurement between the list of related words for other keywords and the list of related words of the at least one classification for the at least one keyword using cosine similarity; and selecting combination of classification and other keywords.

Preferably, the list of related words for other keywords and the list of related words of the at least one classification for the at least on keyword are retrieved using a reverse dictionary.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.

FIG. 1 illustrates a block diagram of a system (100) for predicting query in a search engine according to an embodiment of the present invention.

FIG. 2 illustrates a flowchart of an overall process for predicting query in a search engine according to an embodiment of the present invention.

FIG. 3 illustrates a flowchart of a method to update historical data by a historical data generator (106) of the system (100) of FIG. 1.

FIG. 4 illustrates a flowchart of a method to combine the keywords and classifications by the historical data generator (106) of the system (100) of FIG. 1. FIG. 5 illustrates an example of a process of selecting the highest similarity of related keywords and classifications based on the method of FIG. 4.

DESCRIPTION OF THE PREFFERED EMBODIMENT

A preferred embodiment of the present invention will be described herein below with reference to the accompanying drawings. In the following description, well known functions or constructions are not described in detail since they would obscure the description with unnecessary detail.

Reference is made initially to FIG. 1 which illustrates a system (100) for predicting query in a search engine according to an embodiment of the present invention. The system (100) comprises of a client server (101), a query server (102) and a database (103). The client server (101) is connected to the query server (102), while the query server (102) is connected to the client server (101) and the database (103). The query server (102) comprises of a query manager (104), a historical data generator (105) and a data mining engine (106).

The client server (101) which can either be a server or a client device such as a laptop or a mobile device, is used to submit queries to the query server (102) while the database (103) is used to store all prediction rules. The function of the query manager (104) is to process the queries by extracting keywords and performing query classifications based on their definition by using named entity recognition and lexical database. The historical data generator (105) updates historical data set based on the extracted keywords and keywords classifications, while the data mining engine (106) provides query patterns by looking at user's past experience. This past experience includes the previous query keywords used by the user and the classifications of the keywords based on named entity recognition and lexical database. Since the system (100) provides predicted queries or keywords based on user's past experience which are trackable by the data mining engine (106), the system (100) is capable of improving search efficiency based on its predicted queries. The method to predict queries is further explained in FIG. 2 which illustrates a flowchart of an overall process for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications according to an embodiment of the present invention.

Initially, a user submits an input query from the client server (101) to the query server (102) as in step 201 , wherein the query server (102) passes the query to the query manager (104). The query manager (104) then processes the query by extracting keywords and performing query classifications based on the definition of the query by using named entity recognition and lexical database as in step 202. The process then splits into two separate processes which are a query executing process (the query executing process is referred herein below as process A) and an updating process (the updating process is referred herein below as process B). These two processes are executed simultaneously.

For process A, the query manager (104) obtains a plurality of associated keywords by searching from the previously stored query prediction rules in the database (103) as in step 203. For example, if the previous query entered by user as "Najib Tun Razak Prime Minister Malaysia," the keyword classification result is based on named entity recognition and lexical database, wherein "Najib Tun Razak" is classified as a Person, "Prime Minister" is classified as a Position; and "Malaysia" is classified as a Country. Thereon, the data mining engine (106) generates query prediction rules which comprise of {Najib Tun Razak) = {Najib Tun Razak, Person, Position, Country}; {Prime Minister} = {Prime Minister, Person, Position, Country}; and {Malaysia} = {Malaysia, Person, Position, Country). The process continues by executing the query as in step 204 based on the keywords and associated keywords found in steps 203 and 204 respectively. Once the results of the query have been gathered, the results are sent back to the client server (101) as in step 205. On the other hand, process B starts by updating historical data set as in step

300 based on the extracted keywords and keyword classifications by the historical data generator (105). Once the historical data is populated by combining the keywords and the classifications from past and recent queries, the mining process is then executed to generate predictive relationships or associations between keywords by the data mining engine (106) as in step 400. Referring to previous example, if a user key in a new query keyword of "Malaysia Position," the data mining engine (106) uses association rules to predict new keywords based on the prediction rules explained above. An example of the association rules is {Malaysia Position} = {Najib Tun Razak, Person, Country}. Although the user does not mention "Najib Tun Razak" in his recent query, "Najib Tun Razak" is associated to the new query based on the user's past experience or past query. Finally, these relationship or association rules are stored in the database (103) for future consumption as in step 500.

Referring now to FIG. 3, it shows a flow chart of the method to update historical data by the historical data generator (105) as in step 300. Initially, the historical data generator (105) retrieves the keywords and classifications as in step 310 which are extracted previously from the user's past queries from step 202. Next, the historical data generator (105) separates the current keywords based on their classifications as in step 330 to get the keywords' current classifications. It is similar to the query prediction rules as explained in process A, wherein the keywords classification results are based on named entity recognition and lexical database. The historical data generator (105) then combines each keyword with its relevant classification as in step 350 to combine the user's current query keywords with the previous keywords and classifications from user's past queries. The relevant classification here is based on user's past experience. For example, if user's current keyword is "Prime Minister," the relevant classification is "Najib Tun Razak." Although the user does not mention "Najib Tun Razak' in his current query, "Najib Tun Razak' is associated to the new query based on the user's past experience or past query. Once all the combinations are generated, the historical data generator (105) then stores the combination as historical data in the database (103) as in step 370 for the data mining engine (106) to discover the potential predictive relationships between keywords as in step 400. The step of combining the keywords and classifications has an advantage over the existing search engines as it provides a resource of extracting predictive relationships between words in producing an enhanced keyword search.

A detailed process of step 350 in FIG. 3 is shown in FIG. 4, wherein it illustrates a flowchart of the method to combine the keywords and classifications by the historical data generator (106). Initially, the historical data generator (105) selects a keyword Kx starting with the first keyword from a set of keywords {Κ, ., .,Κη} as in step 351. This set of keywords are generated after the user's current query keywords are separated and classified in step 330, wherein referring to the previous example, after "Najib Tun Razak' is classified as a Person; "Prime Minister" is classified as a Position; and "Malaysia" is classified as a Country, the set of keywords is {Najib Tun Razak, Prime Minister, Malaysia}.

If the historical data generator (105) has finished selecting all keywords from the set of keywords during this process as in decision 352, the historical data generator (105) returns a collection of HKx i.e. {{HKi} {HKn}} as in step 363, wherein HKx is a collection Hm, and wherein Hm is of a combination of keywords and their relevant classifications. On the other hand, if there are more keywords to be selected from the set of keywords, the historical data generator (105) retrieves a list of other keywords, Dm as in decision 352 and step 353, wherein Dm is a list of all keywords except the one chosen in step 351. After retrieving the list of other keywords, Dm, the historical data generator

(105) then retrieves a list of related words for each keyword in list Dm as in step 354 by using a reverse dictionary, wherein the list of related words is represented by WDm. Next, the historical data generator (105) selects a classification Cm of keyword Kx retrieved from step 202 starting with the first classification as in step 355. Referring to the previous example, the set of classification associated to the keyword "Najib Tun Razak' is {Person}. However, keyword's classification can also be more than one, wherein for this example, besides a "Person," "Najib Tun Razak" can also be a "Malay Name" and a "Politician." If the historical data generator (105) has finished selecting all the classifications of Kx as in decision 356, it returns the set of Hm for keyword Kx i.e. HKx as in step 362, wherein Hm is a combination of keyword and relevant classification and HKx is a collection of Hm. On the other hand, if there are more classifications to be selected, the historical data generator (105) continues to retrieve a list of related words of Cm by using a reverse dictionary as in decision 356 and step 357, wherein the list is represented by WCm.

The historical data generator (105) then selects a set of related words WDm starting with the first set of related words based on the list of related words in WDm as in step 358. The process continues by calculating the similarity Vi between WCm and WDm by using the cosine similarity until all the set of related words from WDm have been selected as in decision 359 and step 360. The steps of selecting the set of related words WDm and calculating the similarity of WCm and WDm are repeated until there are no more related words to be selected. Once these steps are completed, the historical data generator (105) returns the set of combination Hm for each keyword, wherein the combination of Cm with other keywords is selected with the combination of {Kx, Cm, Dm} based on the highest value of similarity as in decision 359 and step 361. Once each of the keywords is processed from step 351 to step 361 , the result of set Hm is stored as HKx as in 362. Thereon, the set of HKx is transferred to step 370 to be stored in the database as historical data, which is used for the updating process of process B.

Referring now to FIG. 5, it illustrates an example of a process of selecting the highest similarity of related keywords and classifications as in step 351 to step 363 of FIG. 4 according to an embodiment of the present invention. Initially, the historical data generator (105) selects a keyword Kx starting with the first keyword from a set of keywords {Κ, .,.,Κη} as in step 351. The set of 3 keywords {K1, K2, K3} is represented as {A, B, C}. This set of keywords is generated after the user's current query keywords are separated and classified in step 330. For example, if the 3 keywords from a current query are "Najib Tun Razak;" "Prime Minister," and "Malaysia," A represents "Najib Tun Razak " B represents "Prime Minister," and C represents "Malaysia."

After the historical data generator (105) selects the first keyword K1 i.e. A from the set of keywords {A, B, C} as in step 351, the historical data generator (105) retrieves a list of other keywords, Dm as in decision 352 and step 353, wherein Dm is {B, C}. Next, the historical data generator (105) retrieves a list of related words for each keyword in list Dm as in step 354 by using a reverse dictionary, if there are more keywords to be selected from the set of keywords wherein the list of related words is represented by WDm, and wherein WDm is represented by the small dots in the circles related to B and C.

Next, the historical data generator (105) selects a classification, Cm of keyword A retrieved from step 202 starting with the first classification as in step 355. The example shows that there are 3 classifications {C1, C2, C3} of A which are represented as A1, A2 and A3 respectively, with the first classification C1 is A1. For example if the 3 classifications of "Najib Tun RazaW are "Person," "Malay Name " and "Politician," A1 represents "Person;" A2 represents "Malay Name " and A3 represents "Politician." The historical data generator (105) continues to retrieve a list of related words of each keywords in Cm by using a reverse dictionary as in decision 356 and step 357, if there are more classifications to be selected wherein the list is represented by WCm i.e. {WC1, WC2, WC3} and wherein WCm is represented by the small dots in the circles related to A1, A2 and A3. In other words, WC1 is a collection of keywords related to A1; WC2 is a collection of keywords related to A2; and WC3 is a collection of keywords related to A3.

As in step 358, the historical data generator (105) then selects a set of related words WDm starting with the first set of related words based on the list of related words in WDm i.e. the small dots in the circles related to B. The process continues by calculating the similarity Vi between WCm i.e. {WC1, WC2, WC3), which are the collections of keywords related to A1, A2 and A3 respectively represented by the small dots in the circles related to A1, A2 and A3 and WDm i.e. {WD1, WD2) which are the collections of related words related to keyword β and keyword C respectively represented by the small dots in the circles related to B and C. The similarity measurement is calculated by using the cosine similarity until all the set of related words from WDm have been selected as in decision 359 and step 360. Numbers such as 0.8, 0.5, 0.2, 0.6 and etc. represent the similarity Vi between WCm and WDm. For example, the similarity Vi between the first set of related words WDm i.e. the small dots in the circles related to B, and the first keyword's first classification i.e. the small dots in the circles related to A1 is shown as 0.8. The steps of selecting the set of related words WDm and calculating the similarity of WCm and WDm are repeated until there are no more related words to be selected. The steps of calculating the similarity between the classification of the first keyword i.e. A, with other keywords i.e. B and C are shown in the first row, Row 1.

Once these steps are completed, the historical data generator (105) returns the set of combination Hm for each keyword, wherein the combination of Cm with other keywords is selected with the combination of {Kx, Cm, Dm} based on the highest value of similarity as in decision 359 and step 361. The combination Hm of FIG. 5 is shown as output H1 = {A, At, B}, H2 = {A, A2, C} and H3 = {A, A3, B}. These outputs can be translated as H1 = {'Najib Tun RazaK', "Person", "Prime Minister"}; H2 = {"Najib Tun RazaK "Malay Name", "Malaysia"}; and H3 = {'Najib Tun RazakT, "Politician", "Prime Minister"}. Similarly, the same process is done from step 351 to step 361 on other keywords i.e. B and C to get the combination {Kx, Cm, Dm}. Once each of the keywords is processed from step 351 to step 361 , the result of set Hm is stored as HKx as in 362. Thereon, the set of HKx is transferred to step 370 to be stored in the database as historical data, which is used for the updating process of process B.

While embodiments of the invention have been illustrated and described, it is not intended that these embodiments illustrate and describe all possible forms of the invention. Rather, the words used in the specifications are words of description rather than limitation and various changes may be made without departing from the scope of the invention.

Claims

A system (100) for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications comprises of:

a) at least one client server (101);

b) a query server (102); and

c) a database (103),

characterised in that the query server (102) further comprises of:

i. a query manager (104) for processing queries submitted by the at least one client server (101) by extracting keywords and keywords classifications;

ii. a data mining engine (106) for providing query patterns; and iii. a historical data generator (105) for updating historical data set based on the extracted keywords and keywords.

A method for predicting query in a search engine based on the similarity of an input query with historical data of keywords and classifications is characterised by the steps of:

a) submitting the input query to a query server (102);

b) extracting keywords from the query;

c) performing query classifications on the extracted keywords; and d) performing a query executing process and performing an updating process based on the extracted keywords and classifications. 3. The method as claimed in claim 2, wherein the step of performing query classifications is based on the definition of the query by using named entity recognition and lexical database.

4. The method as claimed in claim 2, wherein the query executing process includes the steps of:

a) obtaining associated keywords by looking up query prediction rules which are stored in a database (103);

b) executing query based on extracted keywords and associated keywords; and

c) sending results to user. The method as claimed in claim 2, wherein the updating process includes the steps of:

a) retrieving keywords and their classifications based on the query;

b) separating each keyword and its classifications;

c) combining keywords and classifications;

d) storing all combinations as mining data;

e) executing mining process based on the updated historical data to generate predictive relationships between the keywords; and f) storing prediction rules in a database (103).

The method as claimed in claim 5, wherein the step of combining keywords and classifications further includes the steps of:

a) selecting at least one keyword to be searched based on the query; b) selecting at least one classification for the at least one keyword;

c) retrieving a list of other keywords based on the query;

d) retrieving a list of related words for the other keywords based on the query;

e) retrieving a list of related words of the at least one classification for the at least one keyword;

f) determining similarity measurement between the list of related words for other keywords and the list of related words of the at least one classification for the at least one keyword using cosine similarity; and g) selecting combination of classification and other keywords.

The method as claimed in claim 6, wherein the list of related words for the other keywords based on the query is retrieved using a reverse dictionary.

The method as claimed in claim 6, wherein the list of related words of the at least one classification for the at least one keyword is retrieved using a reverse dictionary.