WO2007101194A2 - System and method for identifying related queries for languages with multiple writing systems - Google Patents

System and method for identifying related queries for languages with multiple writing systems Download PDF

Info

Publication number
WO2007101194A2
WO2007101194A2 PCT/US2007/062876 US2007062876W WO2007101194A2 WO 2007101194 A2 WO2007101194 A2 WO 2007101194A2 US 2007062876 W US2007062876 W US 2007062876W WO 2007101194 A2 WO2007101194 A2 WO 2007101194A2
Authority
WO
WIPO (PCT)
Prior art keywords
query
queries
candidate set
characters
received
Prior art date
Application number
PCT/US2007/062876
Other languages
French (fr)
Other versions
WO2007101194A3 (en
Inventor
Rosie Jones
Kevin Bartz
Benjamin Rey
Original Assignee
Yahoo! Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yahoo! Inc. filed Critical Yahoo! Inc.
Priority to JP2008557464A priority Critical patent/JP2009528636A/en
Priority to CN200780006965XA priority patent/CN101390097B/en
Priority to EP07757547A priority patent/EP1929415A4/en
Publication of WO2007101194A2 publication Critical patent/WO2007101194A2/en
Publication of WO2007101194A3 publication Critical patent/WO2007101194A3/en
Priority to HK09108573.9A priority patent/HK1130912A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation

Definitions

  • the present invention generally provides methods and systems for identifying one or more queries related to a given search query written according to a language with multiple writing systems. More specifically, the present invention provides methods and systems for receiving a search query written according to combinations of one or more writings systems of a language with multiple writings systems and identifying one or more related queries from a candidate set of queries.
  • a user inputs a query via a client device and a search process returns one or more items of content, such as links, documents, web pages, advertisements, etc., related to the query.
  • the items of content returned in response to a given query may be closely related or entirely unrelated to the subject or topic that the user was actually seeking.
  • the success of a given search which may be measured based upon how closely related the items of content retrieved are to a given query, may depend significantly upon the proper interpretation of a search query.
  • a query is made up of one or more words and phrases. Queries entered by human users, however, often fail to adequately describe the content that a given user may be seeking. Moreover, users may have only a general or vague idea of the content they may be seeking. For example, a user may wish to conduct a search, using the Yahoo! search engine, for a product advertised on television. The user may not know the name of the product, the manufacturer, etc., and may only be able to generally describe the product. Therefore, the query formulated by the user may be too broad, resulting in the retrieval of content items entirely unrelated to the content sought by the user. Similarly, the query terms selected by the user may fail to adequately describe the product, resulting in the retrieval of few if any content items.
  • a single Japanese query submitted to a search engine may be written according to varying combinations of one or more Japanese writing systems, such as Kanji, Katakana, Hiragana, JASCII, ASCII, etc.
  • a query written according to the Japanese Kanji writing system may look entirely different than a query written according to the Japanese Katakana and Hiragana writing systems, however, the two queries may have very similar or identical meanings.
  • search providers such as Yahoo!, MSN, or Google may utilize a bidding marketplace whereby advertisers may bid upon terms in order to have one or more advertisements displayed in response to a query.
  • advertisers may wish to display one or more advertisements for laptop computers and accordingly may bid upon the terms "notebook computer.”
  • the terms "notebook computer,” however, may be written according to one or more writing systems of a language with multiple writing systems, such as Japanese.
  • the terms "notebook computer” may be written according to the Japanese Hiragana writing system, the Japanese Katakana writing system, etc.
  • a user may submit a query comprising the terms "notebook computer" to a given search provider, such as Yahoo!, written according to the Japanese Katakana writing system.
  • the one or more advertisements with associated bids for the Katakana terms "notebook computer” may be retrieved and displayed to the user.
  • the advertisement associated with the advertiser that provided the greatest bid for the Katakana terms "notebook computer” may be displayed in the most prominent position of a web page, e.g., ranked first in a ranked list of advertisements, displayed at the top of a given search results page, etc.
  • the search provider may monetize the selection of the user, such as by charging the advertiser associated with the advertisement selected an amount of money based upon the advertiser's bid. Retrieving and displaying only the advertisements that have associated bids for one or more terms, however, may result in a significant loss of revenue to a given search provider. For example, if a user enters a query comprised of terms that have not been bid upon by one or more advertisers, the search provider may fail to return any advertisements to the user, resulting in a loss of revenue to the search provider as the user will be unable to select any results.
  • the present invention provides systems and methods for identifying one or more queries from a candidate set of related queries that are the most similar in meaning with respect to a given search query, written according to one or more writing systems of a language with multiple writing systems.
  • the present invention is directed towards methods and systems for identifying one or more queries related to a given query.
  • the method of the present invention comprises receiving a query written according to one or more writing systems of a language with multiple writing systems.
  • the query received comprises a query written according to a combination of one or more Japanese writing systems, including the Japanese Hiragana, Katakana, Kana, Romaji, JASCII, and Kanji writing systems.
  • a candidate set of queries written according to one or more writing systems of the language with multiple writing systems associated with the query received is identified.
  • the candidate set of queries comprises one or more queries related to the query received as indicated in one or more query logs.
  • the method further comprises calculating a score for the one or more queries in the candidate set indicating the similarity of the one or more queries with respect to the query received.
  • the score calculated for the one or more queries in the candidate set indicates the similarity in meaning of a given query from the candidate set with respect to the received query.
  • calculating a score comprises calculating a character edit distance between the received query and a query selected from the candidate set after converting the one or more characters in each query to Roman characters.
  • calculating a score comprises calculating a character edit distance between the received query and a query selected from the candidate set after converting the one or more characters in each query to Roman characters and removing space characters from each query.
  • calculating a score comprises converting the characters of the query received and a query selected from the candidate set to Roman characters, and calculating the difference between one ("1") and the quotient of the number of unique space-separated co-occurring words in the received query and the selected query and the total number of unique space-separated words in both queries.
  • calculating a score comprises identifying whether a digit is unique to the received query and a query selected from the candidate set.
  • calculating a score comprises calculating a difference between the value one ("1") and quotient of the number of co-occurring Japanese Kanji characters in the received query and a selected query from the candidate set, and the total number of unique Japanese Kanji characters in the received query and the selected query from candidate set.
  • calculating a score comprises converting the one or more characters of the received query and a query selected from the candidate set to Roman characters and calculating a number of Roman characters the queries have in common.
  • calculating a score comprises identifying whether either the received query or a selected query from the candidate set contain a non-Roman character.
  • calculating a score comprises calculating a character edit distance between the received query and a selected query from the candidate set after converting the Japanese Kanji characters of each query to Japanese Kana characters and removing all non-Japanese characters from each query.
  • calculating a score comprises calculating a quotient of the frequency with which a selected query from the candidate set follows the received query in one or more query logs and the frequency of the received query in the one or more query logs.
  • the method further comprises selecting one or more of the queries from the candidate set for distribution.
  • the one or more queries selected from the candidate set for distribution comprise queries with scores exceeding a given threshold.
  • the one or more queries selected for distribution may be distributed.
  • the queries selected for distribution are embedded in one or more web pages.
  • the invention is also directed towards a system for identifying one or more queries related to a given query.
  • the system of the present invention comprises a search engine operative to receive a query written according to one or more writing systems of a language with multiple writing systems.
  • the search engine is operative to receive a query written according to one or more Japanese writing systems.
  • the search engine is further operative to identify a candidate set of one or more queries written according to one or more writing systems of the language with multiple writing systems associated with the query received.
  • the search engine is operative to identify a candidate set comprised of one or more queries related to the received query as indicated in one or more query logs.
  • a conversion component is operative to convert the received query and the one or more queries in the candidate set into one or more written formats. According to one embodiment of the invention, the conversion component is operative to convert a query into one or more written formats in accordance with one or more writing systems.
  • a similarity component is operative to calculate a score for the one or more queries in the candidate set indicating the similarity of the one or more queries with respect to the received query. The similarity component is operative to calculate a score indicating the similarity in meaning of a selected query from the candidate set with respect to the received query. According to one embodiment of the invention, the similarity component is operative to calculate a character edit distance between the received query and a selected query from the candidate set.
  • the similarity component is operative to calculate a difference between one ("1") and the quotient of the number of unique space-separated co-occurring words in the received query and a query selected from the candidate set, and the total number of unique space-separated words in both queries.
  • the similarity component is operative to identify whether a digit is unique to the received query or a selected query from the candidate set.
  • the similarity component is operative to calculate a difference between one ("1") and the quotient of the number of co-occurring Japanese Kanji characters in the received query and a query selected from the candidate set, and the total number of unique Japanese Kanji characters in the both queries.
  • the similarity component is operative to calculate a number of characters the received query and a selected query from the candidate set have in common. According to yet another embodiment of the invention, the similarity component is operative identify whether the received query or a selected query from the candidate set contain one or more characters of a given writing system. According to a further embodiment, the similarity component is operative to calculate a quotient of the frequency with which a selected query from the candidate set follows the received query in one or more query logs and the frequency of the received query in the query logs.
  • FIG. 1 is a block diagram presenting a system for identifying one or more related queries written in accordance with combinations of one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention
  • FIG. 2 is a flow diagram illustrating one embodiment of a method for selecting one or more related queries written in accordance with a combination of one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention
  • FIG. 3 is a flow diagram illustrating one embodiment of a method for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention
  • FIG. 4 is a flow diagram illustrating another embodiment for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention
  • FIG. 5 is a flow diagram illustrating one embodiment of a method for calculating the word edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention
  • FIG. 6 is a flow diagram illustrating one embodiment of a method for identifying differences in the digits appearing in two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention
  • FIG. 7 is a flow diagram illustrating one embodiment of a method for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems considering the characters of only one of the writings systems, according to one embodiment of the present invention
  • FIG. 8 is a flow diagram illustrating one embodiment of a method for identifying the number of characters overlapping in the prefixes of two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention
  • FIG. 9 is a flow diagram illustrating one embodiment of a method for identifying whether two queries written in accordance with one or more writing systems of a language with multiple writing systems have non-Roman characters, according to one embodiment of the present invention.
  • FIG. 10 is a flow diagram illustrating one embodiment of a method for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems after both queries have been converted to a given writing system, according to one embodiment of the present invention.
  • FIG. 11 is a flow diagram illustrating one embodiment of a method for calculating the query and phrase substitution probability of two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention.
  • FIG. 1 presents a block diagram depicting one embodiment of a system for identifying one or more queries related to a given query written according to one or more writing systems of a language with multiple writing systems. According to the embodiment of FIG.
  • client devices 124a, 124b and 124c are communicatively coupled to a network 122, which may include a connection to one or more local and/or wide area networks, such as the Internet.
  • a client device 124a, 124b and 124c is a general purpose personal computer comprising a processor, transient and persistent storage devices, input/output subsystem and bus to provide a communications path between components comprising the general purpose personal computer.
  • a 3.5 GHz Pentium 4 personal computer with 512 MB of RAM, 40 GB of hard drive storage space and an Ethernet interface to a network.
  • client devices are considered to fall within the scope of the present invention including, but not limited to, hand held devices, set top terminals, mobile handsets, PDAs, etc.
  • Users of client devices 124a, 124b and 124c communicatively coupled to the network 122 may submit search queries, comprising one or more terms, to the search provider 100.
  • a search query submitted by a user via the network 122 to the search provider 100 may comprise one or more characters, terms, or phrases written according to one or more writing systems of a language with multiple writing systems.
  • users of client devices 124a, 124b, and 124c may formulate queries comprising Japanese Kanji characters, Japanese Katakana characters and JASCII characters.
  • users of client devices 124a, 124b and 124c may formulate queries comprising Japanese Romaji characters, Japanese Hiragana characters, and digits.
  • queries comprising Japanese Romaji characters, Japanese Hiragana characters, and digits.
  • a user may submit the following query, written according to a combination of the Japanese Katakana, Hiragana, Kanji and ASCII writing systems: 1 V y h ⁇ OtM RK ⁇ ⁇ ) % ' .
  • the one or more search queries submitted by users of client devices 124a, 124b, and 124c may be used by the search engine 107 at the search provider 100 to identify a candidate set of related queries.
  • the one or more queries comprising the candidate set of related queries may be maintained in one or more local or remote data stores 102 and 108, respectively, which are operative to maintain one or more queries that may be related to a given query.
  • data stores 102 and 108 are operative to maintain indices with entries identifying a set of queries related to one or more queries or terms.
  • the indices maintained by data stores 102 and 108 may be supplemented with human editorial information indicating terms or queries that are related. For example, an index entry in data stores 102 and 108 may comprise the query "1 V y h /IOM
  • V # written in accordance with the Japanese Katakana, Hiragana, Kanji, and ASCII
  • Data stores 102 and 108 may be implemented as databases or any other type of storage structures capable of providing for the retrieval and storage of one or more sets of queries, such as a database, CD-ROM, tape, digital storage library, etc.
  • the queries maintained in data stores 102 and 108 may comprise queries written according to one or more writing systems of a given language with multiple writing systems.
  • the queries maintained in data stores 102 and 108 may comprise queries written in accordance with the Japanase Kanji, Hiragana, Katakana, JASCII, and Romaji writing systems.
  • a candidate set of related queries identified by the search engine 107 comprises one or more sequential pairs of queries that co-occur with statistical significance in one or more query logs.
  • the search engine 107 may utilize query logs to identify a candidate set comprising one or more queries related to a query received from a client device 124a, 124b, and 124c.
  • the plurality of queries submitted by users to the search provider 100 which may be written according to one or more writing systems of a language with multiple writing systems, may be maintained in a query log component 106.
  • the query log component 106 may be implemented as a database or similar storage structure capable of providing for the storage of one or more queries written according to one or more writing systems.
  • the query log component 106 may maintain information identifying the frequency with which queries are submitted to the search provider 100. Similarly, the query log component 106 may maintain information identifying the frequency with which a given query follows a related query. For example, during a given session a user conducting a search may submit a query comprising the terms "intellectual property," written according to one or more writing systems of a language with multiple writing systems, e.g., Japanese. During the same session, the user may submit a query comprising the terms "patent attorney,” written according to one or more Japanese writing systems. The query log component 106 may maintain information identifying the frequency with which the query "patent attorney” follows the query "intellectual property" during a given user's session.
  • the search engine 107 may utilize the query logs maintained by the query log component 106 to identify a candidate set comprising one or more queries that are statistically significantly related to a query received from a given client device 124a, 124b, and 124c.
  • the one or more queries identified as related to a given query may be used to supplement or generate a candidate set of related queries.
  • the candidate set of related queries may comprise queries written according to one or more writing systems of a given language with multiple writing systems, such as Japanese. Exemplary methods for identifying one or more queries related to a given query using query logs are described in commonly owned U.S. Patent Application Serial No.
  • a similarity component 104 uses the candidate set identified by the search engine 107 to calculate a similarity score for the one or more queries in the candidate set of related queries.
  • the similarity component 104 is operative to select a given query Q' from the candidate set of related queries and calculate a similarity score for Q', indicating the strength of the similarity in meaning of Q' with respect to a given query Q received from a given client device 124a, 124b and 124c.
  • the similarity component 104 is operative to calculate a similarity score for each of the one or more queries in the candidate set of related queries identified by the search engine 107 according to methods described herein.
  • the similarity component 104 may utilize a conversion component 110 to calculate a similarity score for each query Q' in the candidate set of related queries identified by the search engine 107.
  • the conversion component 110 converts a given query into one or more or written formats.
  • the one or more written formats of a given query Q' generated by the conversion component 110 may be delivered to the similarity component 104 to facilitate the calculation of a similarity score.
  • the similarity component 104 may perform numerous comparisons with respect to a given query Q, received from a user, and a related query Q', selected from a candidate set of related queries, in order to calculate an accurate similarity score.
  • the one or more queries in the candidate set of related queries may be written according to one or more writing systems of a given language with multiple writing systems.
  • a query received from a given client device 124a, 124b, and 124c may be written according to one or more systems of a given language with multiple writing systems.
  • the one or more comparisons performed by the similarity component 104 may require a query received from a user, Q, and a given query selected from the candidate set of related queries, Q', to be expressed in accordance with a particular writing system.
  • the similarity component 104 may require that the one or more JASCII characters of a given query Q and a related query Q' be converted to ASCII characters in order to compare the two queries.
  • the similarity component 104 may deliver a given query to the conversion component 110.
  • the conversion component 110 is operative to identify the language and writing system associated with a given query and convert the query into one or more alternate written formats.
  • the candidate set identified by the search engine 107 may comprise queries written according to a variety of writing systems of a given language with multiple writing systems, such as the Japanese Kanji, Kana, JASCII, and Romaji writing systems.
  • the conversion component 110 is operative to identify a query as written according to one or more Japanese writing systems and convert the query into one or more alternate writing systems.
  • the conversion component 110 is operative to identify a query as written according to the Japanese Katakana writing system and convert the query in accordance with the Japanese Romaji writing system. Similarly, the conversion component 110 is operative to identify a query as comprising one or more JASCII characters and convert the one or more JASCII characters to ASCII characters in order facilitate the calculation of a similarity score by the similarity component 104.
  • the similarity scores calculated by the similarity component 104 for the one or more queries in the candidate set of related queries are used by a distribution component 116 to select one or more queries from the candidate set for distribution.
  • the selection of queries based upon similarity scores allows for the selection of queries that are the most similar in meaning with respect to a given query Q.
  • the distribution component 116 may select one or more queries from the candidate set of related queries with similarity scores exceeding a given threshold.
  • the distribution component may select the N queries from the candidate set with the greatest similarity scores.
  • the distribution component 116 may distribute the one or more queries selected from the candidate set.
  • the distribution component 116 displays the queries selected from the candidate set to a user via the network 122 as "suggested alternate queries" or “queries similar in meaning.”
  • the distribution component 116 is operative to deliver the one or more queries selected to the search engine 107, which may embed the selected queries in a search result web page that may be viewed by a given user of a client device 124a, 124b, and 124c communicatively coupled to the network 122.
  • the similarity scores calculated by the similarity component 104 for the one or more queries in the candidate set may further be used to select one or more items of content, including advertisements, for distribution in response to a given request.
  • advertisements may be maintained in the abovementioned data stores 102 and 108, or in one or more disparate data stores (not illustrated).
  • the one or more local 102, remote 108, or disparate data stores are operative to maintain one or more advertisements and associated bids for terms corresponding to the advertisements. For example, a given advertiser may wish to display a given advertisement for notebook computers.
  • the advertiser may thus bid on the terms "notebook computer” and identify the advertisement that is to be displayed in response to a query comprising the terms "notebook computer.”
  • the search engine 107 may search local and remote data stores 102 and 108, or one or more disparate data stores, to determine whether one or more advertisers have provided bids for the one or more terms comprising the query received. If one or more bids for the terms comprising the query are identified, the advertisements associated with the bids for the one or more terms may be retrieved and displayed to the user on the user's client device 124a, 124b and 124c using the distribution component 116.
  • the advertiser associated with the advertisement selected may be charged a monetary sum in accordance with the advertiser's bid.
  • An advertiser may choose to bid upon terms written according to only a single writing system of a language with multiple writing systems. For example, an advertiser may choose to bid upon terms written according to only the Japanese Hiragana writing system.
  • the one or more search queries submitted by users of client devices 124a, 124b, and 124c may comprise terms and phrases written according to one or more writing systems.
  • the search engine 107 may thus utilize the queries with similarity scores exceeding a given threshold to expand the breadth of advertisements retrieved in response to a given query.
  • the search engine 107 identifies the one or more advertisements responsive to the terms comprising the one or more queries with similarity scores exceeding a given threshold.
  • the one or more advertisements identified as responsive to the terms comprising the queries with similarity scores exceeding a given threshold may be selected for distribution to one or more client devices 124a, 124b, and 124c.
  • a user of a client device 124a, 124b and 124c may formulate a search query Q comprised of Japanese terms written according to both the Japanese Kanji and Romaji writing systems.
  • the user may submit the query to the search provider 100 via the network 122.
  • the search engine 107 may determine that no advertisers have provided bids for the Kanji and Romaji terms utilized by the user.
  • the search engine 107 may determine that displaying the advertisements corresponding to the bids associated with the Kanji and Romaji terms utilized by the user will result in little if any revenue.
  • the search engine 107 may utilize the terms comprising the one or more queries selected from the candidate set with similarity scores exceeding a given threshold to identify one or more terms with associated bids.
  • the search engine 107 may utilize the terms comprising the one or more queries selected from the candidate set with similarity scores exceeding a given threshold to identify one or more terms with bids exceeding a given threshold.
  • the search engine 107 may thereafter utilize the one or more terms with associated bids, or the one or more terras with associated bids exceeding a given threshold, to select one or more advertisements responsive to the search query Q formulated by the user.
  • a given query Q' selected from the candidate set with a similarity score exceeding a given threshold comprises Hiragana terms
  • the abovementioned query Q formulated by the user comprises Kanji and Romaji terms.
  • the search engine may utilize the one or more Hiragana terms comprising query Q' to determine whether one or more advertisers have bid upon the Hiragana terms comprising query Q'.
  • the search engine may determine whether one or more advertisers have provided bids for the one or more Hiragana terms comprising query Q' exceeding a given threshold.
  • the search engine 107 may retrieve the one or more advertisements with associated bids for the terms comprising query Q' and deliver the one or more advertisements to the distribution component.
  • the search engine 107 retrieves the one or more advertisements with the greatest associated bids for the one or more terms comprising query Q'.
  • the distribution component 116 may thereafter distribute the one or more advertisements to the user that submitted the query Q.
  • the search provider 100 system illustrated in FIG. 1 is not limited to receiving and calculating similarity scores for queries, and may further be used to calculate similarity scores for one or more terms comprising one or more strings of text.
  • Users of client devices 124a, 124b, and 124c may deliver to the search provider 100 one or more strings of text comprising one or more terms, including but not limited to, phrases, sentences, paragraphs, and documents written according to one or more writing systems of a language with multiple writing systems. Accordingly, the search provider 100 records a log of these one or more strings of text in one or more log files.
  • the search provider 100 is operative to identify a candidate set comprising one or more items from its log files, wherein a given item comprises one or more sets of terms related to the one or more terms delivered by a given user of a client device 124a, 124b, and 124c.
  • a given item in the candidate set may comprise a phrase or a sentence.
  • a given item in the candidate set may comprise a paragraph or an entire document.
  • the search provider may calculate a similarity score for the one or more items in the candidate set indicating the strength of the similarity in meaning of an item with respect to the one or more terms received from a client device 124a, 124b, and 124c.
  • FIG. 2 illustrates one embodiment of a method for selecting one or more queries Q' from a candidate set that are related in meaning to a given query Q, wherein query Q and Q' are written in accordance with one or more writing systems of a language with multiple writing systems.
  • a search query is received from a given user, step 205.
  • the query may be received from a client device communicatively coupled to a network, such as the Internet, and comprise one or more terms or phrases written according to combinations of one or more writing systems of a language with multiple writing systems.
  • the query received from a user may comprise Japanese terms written in accordance with the Kanji, Katakana, and Hiragana writing systems.
  • a candidate set comprised of queries related to a given query Q formulated by a user is identified, step 210.
  • the candidate set may be comprised of queries written according to one or more writing systems of the language associated with the user's query.
  • a given query Q may comprise terms written in accordance with the Japanese Katakana writing system, such as the query 7 ⁇ r X
  • the candidate set of related queries may thus comprise one or
  • the candidate set of queries related to the abovementioned Hiragana query y 9 ⁇ V may comprise the Romaji query rakuten, the Kanji query MX, the
  • the candidate set of queries related to a given query Q may be generated using one or more query logs.
  • query logs may identify one or more queries formulated by a user during a given query session. For example, during a given query session, a user may formulate a query comprising terms written according to the Japanese Hiragana and Kanji writing systems. During the same query session, a user may also formulate a query comprising terms written according to the Japanese Katakana and Romaji writing systems. An analysis may be performed to determine whether the two queries co-occur in the one or more query logs with statistical significance. According to one embodiment of the invention, a statistical significance threshold value may be used to select the one or more queries that are the most related to a given query Q as indicated by one or more query logs.
  • the candidate set may be generated with the one or more queries identified as related to the given query with statistical significance, or with statistical significance exceeding a given threshold value as indicated by one or more query logs.
  • the one or more queries comprising the candidate set of related queries may be selected according to methods for determining statistically significantly related queries using query logs described in the above-identified applications incorporated by reference in their entirety.
  • a given query Q' is selected from the candidate set of related queries, step 215.
  • a similarity score is calculated for the query Q' selected, step 220.
  • the similarity score calculated for a given query Q' provides a numerical value indicating the strength of the similarity of the meaning of query Q' with respect to the meaning of a given query Q, written according to one or more writing systems of a language with multiple writings systems.
  • Table A illustrates one embodiment of an equation that may be used to calculate a similarity score for a given query Q'.
  • the equation presented in Table A may be used to calculate a score indicating the strength of the similarity in meaning of a given query Q' with respect to a given query Q, which may be written according to one or more Japanese writing systems, including, but not limited to Kanji, Kana, JASCII, Kana, Katakana, Romaji, and Hiragana.
  • Japanese writing systems including, but not limited to Kanji, Kana, JASCII, Kana, Katakana, Romaji, and Hiragana.
  • Similarity score(Q') 1.47551 + levr (Q,Q') x -1.68821 + levrs(Q, Q') x 2.48700 + wordr (Q, Q') x 0.44366 + digit (Q,Q') x 0.75388 + kanjid(Q,Q') x 0.22496 + opr(Q,Q') x -0.40083 + Japanese (Q 5 Q') x 0.09368 + levk (Q 5 Q') x -0.32574 +pl2min(Q,Q') x -
  • Q represents a given query written according to one or more Japanese writing systems.
  • Q' represents a query selected from a candidate set of queries related to query Q.
  • Lew is a function for calculating the character edit distance between Q and Q' after converting all Japanese characters to Roman characters.
  • Lews is a function for calculating the character edit distance between Q and Q' after converting all Japanese characters to Roman characters and removing spaces.
  • Wordr is the word edit distance between Q and Q' after converting all Japanese characters to Roman characters.
  • Digit is a function for identifying whether Q contains any digits not appearing in Q' and vice versa.
  • Kanjid is a function for determining whether either Q or Q' contains Kanji characters, and if so, identifying a Kanji disagreement between Q and Q'.
  • Opr is a function for calculating the number of characters Q and Q' have in common starting from the leftmost character of each query and continuing until the first character disagreement, after all after all Japanese characters in each query have been converted to Roman characters.
  • Levk is a function for calculating the character edit distance between Q and Q' after all Kanji characters have been converted to Kana characters and all non- Japanese characters have been removed.
  • P12min is a function for calculating the query substitution probability of query Q' following query Q in a log of user query sessions. Embodiments of the functions utilized by the similarity score function illustrated in Table A are illustrated in FIG. 3 through FIG. 11.
  • a check is performed to determine whether a similarity score has been calculated for the one or more queries in the candidate set, step 225. If one or more queries in the candidate set do not have an associated similarity score, an additional query Q' is selected from the candidate set, step 215. Alternatively, if a similarity score has been calculated for the one or more queries in the candidate set, a given query Q' is selected from the candidate set, step 230. A check is performed to determine whether the similarity score associated with the query Q' selected from the candidate set exceeds a given similarity score threshold, step 235.
  • the similarity score threshold comprises a numeric value that may be used to perform a comparison with the similarity score associated with a given query Q'.
  • a similarity score indicates the strength of the similarity in meaning of a given query Q' with respect to a query Q
  • the use of a similarity score threshold facilitates the selection of one or more queries from the candidate set that are the most similar in meaning with respect to the query Q.
  • the query Q' is added to a distribution set, step 245.
  • the distribution set comprises the one or more queries selected from the candidate set that have similarity scores exceeding the similarity score threshold. If the similarity score associated with a given query Q' does not exceed the similarity score threshold, the query Q' is not added to the distribution set, step 240.
  • a check is performed to determine whether there are additional queries in the candidate set that require analysis, step 250. If one or more queries in the candidate require analysis, an additional query Q' is selected from the candidate set, step 230. Alternatively, after all queries in the candidate set have been analyzed, and the distribution set has been populated with the one or more queries exceeding the similarity score threshold, the one or more queries in the distribution set are distributed, step 255.
  • the one or more queries in the distribution set of queries exceeding the similarity score threshold may be delivered to the user who submitted the query Q.
  • the one or more queries in the distribution set are displayed to the user in a results web page.
  • a user may be presented with a web page comprising results, such as links to content items that are responsive to the query Q, as well the one or more Q' queries comprising the distribution set that are most similar in meaning with respect to the query Q.
  • the one or more queries in the distribution set delivered to a given user may be displayed in a ranked list according to similarity score to indicate to the user the relative strength of the similarity in meaning of a given query Q' with respect to the query Q.
  • FIGs. 3 through 11 illustrate embodiments of the functions presented in Table A that may be used to calculate a similarity score for a given query Q' selected from a candidate set of queries.
  • the plurality of functions illustrated in Table A, and further described in FIGs. 3 through 11, may be used to calculate a similarity score indicating the strength of the similarity in meaning of a given query Q' with respect to a query Q, written according to one or more Japanese writing systems.
  • Those of skill in the art recognize, however, that the embodiments of the functions illustrated in FIGs. 3 through 11 are exemplary and not intended to be limited to the Japanese language and writing systems, and may be modified so as to provide for the calculation of a similarity score for other languages with multiple writing systems.
  • FIGs. 3 through 11 are not limited to calculating similarity scores for a candidate set comprising one or more queries related to a given query and may be used to calculate similarity scores for a candidate set of queries comprising one or more queries selected according to a plurality of techniques. Additionally, those of skill in the art recognize that the functions illustrated in FIGs. 3 through 11 are not limited to calculating similarity scores for a candidate set comprising one or more queries and may be modified so as to calculate similarity scores for one or more sets of terms, including but not limited to, phrases, sentences, paragraphs, and documents. [0063] FIG.
  • FIG. 3 illustrates one embodiment of a method for calculating the character edit distance between a given query Q, written according to one or more Japanese writing systems, and a query Q' selected from a candidate set of queries.
  • the method presented in FIG. 3 illustrates one embodiment of the levk function utilized by the similarity score function illustrated in Table A.
  • the one or more characters comprising a query Q which may be written according to one or more Japanese writing systems, such as Kanji, Katakana, Hiragana, etc., are converted to Roman characters, step 305.
  • a given query Q' is selected from a candidate set comprised of one or more queries, step 310.
  • the query Q' selected from the candidate set may be written according to one or more writing systems of the language associated with query Q.
  • Q' may be written according to the same writing system as the query Q or one or more alternate Japanese writing systems, such as the Japanese Romaji writing system, the Japanese Kana writing system, etc.
  • a check is performed to determine whether the characters comprising Q' are in Roman character form, step 315. If query Q' is not in Roman character form, the one or more characters comprising Q' are converted into Roman characters, step 320. If the one or more terms comprising Q' are already in Roman character form, or after all the characters in Q' have been converted to Roman character form, a calculation is performed to identify the character edit distance between query Q and query Q', step 325.
  • the character edit distance value may be provided to the similarity score function illustrated in Table A in order to calculate a similarity score for Q'.
  • FIG. 4 illustrates one embodiment of a method for calculating the character edit distance between a given query Q, written according to one or more Japanese writing systems, and a query Q' selected from a candidate set of queries.
  • the embodiment illustrated in FIG. 4 provides one embodiment of the lews function used by the similarity score function illustrated in Table A.
  • query Q written according to one or more Japanese writing systems, such as Kanji, Katakana, or Hiragana, is converted to Roman character form, step 405. Thereafter, all space characters appearing in the Roman character form query Q are removed, step 408.
  • a given query Q may comprise the Kanji terms
  • query Q may comprise the terms "densha otoko," and after removing spaces query Q may comprise the characters "denshaotoko.”
  • a given query Q' is selected from a candidate set comprising one or more queries, step 410.
  • a check is performed to determine whether Q' is in Roman character form, step 415. If query Q' is not Roman character form, the one or more characters comprising query Q' are converted to Roman characters, step 420. If the characters comprising query Q' are already in Roman character form, or after the characters comprising query Q' have been converted to Roman character form, all spaces within query Q' are removed, step 425.
  • FIG. 5 illustrates one embodiment of the wordr function illustrated in Table A.
  • the embodiment of the wordr function illustrated in FIG. 5 provides for the calculation of the word edit distance between a given query Q, written according to one or more Japanese writing systems, and a query Q', selected from a candidate set of queries.
  • the word edit distance between a given query Q and a query Q' is the difference between the value one ("1") and the quotient of the number of unique space-separated co- occurring words in Q and Q' and the total number of unique space-separated words in both Q and Q'.
  • the characters comprising a given query Q are converted to Roman character form, step 505. Thereafter, a given query Q' is selected from a candidate set of queries, step 506. A check is performed to determine whether the query Q' is in Roman character form, step 508. If query Q' is not in Roman character form, the characters comprising query Q' are converted to Roman characters, step 510. If the characters comprising the query Q' are already in Roman character form, or after the characters comprising Q' have been converted to Roman character form, the number of unique space-separated co-occurring words in Q and Q' are identified, step 515.
  • the quotient of the number of unique space-separated co-occurring words in Q and Q' and the total number of unique space-separated words in both Q and Q' is calculated, step 520.
  • the number of unique space-separated co-occurring words comprises the number of unique words that appear in both a given query Q and a query Q'.
  • the total number of unique space-separated words in both Q and Q' comprises the sum of the unique space-separated words in a given query Q and a query Q'.
  • the difference between the value one ("1") and the calculated quotient is calculated, step 525, and assigned to a 'wordr' register, step 530.
  • the 'wordr' register comprises a memory device for storing a given numeric value.
  • the value assigned to the 'wordr' register may be used by the similarity score function illustrated in Table A to calculate a similarity score for the query Q'.
  • a given query Q in Roman character form, may be comprised of the terms "kuruma kemuri.”
  • a given query Q' in Roman character form may be comprised of the terms “sora kemuri.”
  • the number of unique space-separated co-occurring words in Q and Q' is one ("1"), namely, the word “kemuri,” wherein the total number of unique space-separated words in both Q and Q' is three (“3"), namely the words "kuruma,” “sora,” and “kemuri.”
  • the quotient of the number of unique space-separated co-occurring words in Q and Q' and the total number of unique space-separated words in both Q and Q' is 1/3.
  • the difference between one ("1") and the calculated quotient is 2/3.
  • the value 2/3 may be assigned to the 'wordr' register and used by the similarity score function illustrated in Table A to calculate a similarity score for query Q'.
  • FIG. 6 illustrates one embodiment of a method for determining whether a digit is unique to a given query Q, written according to one or more Japanese writing systems, in comparison with a query Q' selected from a candidate set of queries.
  • the embodiment presented in FIG. 6 provides one embodiment of the digit function used by the similarity score function illustrated in Table A.
  • a given query Q' is selected from a candidate set comprised of queries written according to one or more writing systems, step 605. A check is performed to determine whether a digit in a given query Q does not appear in query Q'.
  • a given query Q may
  • a given query Q' may contain the Japanese Kanji digits
  • a given query Q may comprise the Japanese Kanji characters and the Arabic numerals M ⁇ . 2005, and a given query Q' may
  • step 610 performed at step 610 would determine that the Arabic numeral 5 is unique to query Q, as it does not appear in query Q'.
  • a 'digit' register is set to the value one ("1"), indicating that query Q contains a digit not in query Q', step 620.
  • the 'digit' register comprises a memory device for storing a given numeric value.
  • step 615 If Q' contains each of the one or more digits appearing in query Q, an additional check is performed to determine whether a digit in query Q' does not appear in query Q, step 615. If query Q' contains a digit which does not appear in query Q, the abovementioned 'digit' register is set to the value one ("1"), indicating that query Q' contains a digit unique to Q', step 620. Alternatively, if query Q contains each of the one or more digits in Q', the 'digit' register is set to zero ("0"), step 625, indicating that the one or more digits in query Q' appear in query Q, and vice versa.
  • FIG. 7 presents one embodiment of the kanjid function used by the similarity score function illustrated in Table A.
  • a given query Q which may be written according to one or more Japanese writing systems, is received, step 705.
  • a check is performed to determine whether query Q contains one or more Japanese Kanji characters, step 710. If query Q does not contain any Kanji characters, a 'kanjid' register is set to zero ("0"), step 708, wherein the 'kanjid' register may comprise a memory device for storing a given numeric value.
  • a query Q' is selected from a candidate set of queries, step 715.
  • Q is comprised of Kanji characters and if after removing non-Kanji characters query
  • Q' is comprised of Kanji characters 5$c5 ⁇ , the number of unique co-occurring Kanji characters
  • the total number of unique Kanji characters in both Q and Q' is thereafter identified, step 727.
  • the 'kanjid' register is set to the value of the difference between one ("1") and the calculated quotient, step 735.
  • the 'kanjid' register value may be used by the similarity score function illustrated in Table A to calculate a similarity score for Q'.
  • FIG. 8 illustrates one embodiment of a method for identifying the number of characters overlapping in the prefixes of a given query Q, written according to one or more Japanese writings systems, and a query Q' selected from a candidate set of queries, starting with a comparison of the leftmost characters of each query and continuing until the first character disagreement.
  • the method presented in FIG. 8 illustrates one embodiment of the opr function utilized by the similarity score function illustrated in Table A.
  • a given query Q written according to one or more Japanese writing systems, is converted to Roman character form, step 805.
  • a query Q' is selected from a candidate set of queries, step 810.
  • a check is performed to determine whether the one or more characters comprising query Q' are in Roman character form, step 815. If the one or more characters comprising query Q' are not in Roman character form, the characters are converted to Roman characters, step 820. If the characters comprising Q' are already in Roman character form, or after the one or more characters comprising Q' have been converted to Roman character form, the first Roman characters of query Q and query Q' are selected, step 825. [0081] A check is performed to determine whether the first character selected from query Q and the first character selected from query Q' match, step 835.
  • step 830 If the first characters selected from Q and Q' do not match, processing terminates, step 830.
  • a character match count register is incremented, step 850, indicating that a character match for query Q and query Q' was identified.
  • the character match count register is initialized with the value zero ("0") and incremented as characters from query Q and query Q' are identified as matching.
  • step 840 The next character from Q and Q' are selected, step 840, and a check is performed to determine whether the next characters match, step 835. If the characters selected from Q and Q' do not match, the character match count register is not incremented and processing ends, step 830.
  • the value in the character match count register will indicate the number of characters that match in Q and Q'.
  • the value in the character match count register is utilized by the similarity score function illustrated in Table A to calculate a similarity score for query Q'.
  • FIG. 9 illustrates one embodiment of a method for identifying whether a given query Q, written according to one or more Japanese writing systems, or a query Q', selected from a candidate set of queries, contain non-Roman characters.
  • the embodiment presented in FIG. 9 illustrates the Japanese function used by the similarity score function illustrated in Table A.
  • a given query Q, written according to one or more Japanese writing systems is received, step 905.
  • a check is performed to determine whether query Q contains one or more non-Roman characters, step 910. If query Q contains one or more non-Roman characters, a 'Japanese' register is set to the value one ("1"), step 908.
  • the 'Japanese' register comprises a memory device for storing a given numeric value.
  • a query Q' is selected from a candidate set comprising one or more queries, step 915.
  • a check is performed to determine whether query Q' contains one or more non-Roman characters, step 920. If query Q' contains one or more non-Roman characters, the 'Japanese' register is set to the value ("1"), step 908. Alternatively, if query Q' contains only non-Roman characters, the 'Japanese' register is set to the value zero ("0"), step 922, and processing is thereafter terminated, step 925.
  • the value maintained in the 'Japanese' register may be utilized by the similarity score function illustrated in Table A to calculate a similarity score for query Q'.
  • FIG. 10 illustrates one embodiment of a method for determining the character edit distance between a given query Q and a query Q' after all Kanji and non- Japanese characters have been removed from each respective query.
  • the method presented in FIG. 10 illustrates one embodiment of the levk function utilized by the similarity score function illustrated in Table A.
  • a given query Q' is selected from a candidate set of queries, step 1005.
  • a check is performed to determine whether query Q' or a given query Q, written according to one or more Japanese writing systems, contains one or more Kanji characters, step 1010. If either query Q or query Q' contain one or more Kanji characters, the Kanji characters in each respective query are converted to Kana characters, step 1015.
  • query Q may
  • query Q may comprise the characters
  • non- Japanese characters comprise characters not written according to one or more Japanese writing systems. For example, if query Q includes
  • Kana characters and Arabic numerals, such as ⁇ / ⁇ — ⁇ , the Arabic numeral "200" may comprise non- Japanese characters.
  • query Q If either query Q or query Q' contain non- Japanese characters, the non- Japanese characters are removed, step 1025.
  • query Q After removing non- Japanese characters from query Q, namely the Arabic numeral "200," query Q
  • the character edit distance between query Q and query Q' may be used by the similarity score function illustrated in Table A to calculate a similarity score for Q'.
  • FIG. 11 presents one embodiment of the pl2min function utilized by the similarity score function illustrated in Table A.
  • ⁇ hepl2min function calculates the query substitution probability of a given query Q' following a given query Q, and may also be used to calculate the phrase substitution of a phrase P' following a given phrase P.
  • one or more query logs may be maintained identifying the one or more queries and phrases submitted by a given user during a query session.
  • the query logs may identify the order of the one or more queries and phrases submitted by the user, for example, to provide an indication of how the user refined a query Q, how the user rewrote a query Q, how the user utilized one or more alternate writing systems of a language with multiple writings systems to express a query Q, etc.
  • the query logs may further indicate the frequency with which one or more users submitted one or more queries or phrases.
  • step 1105. A given query Q' is selected from a candidate set of queries, step 1110.
  • a check is performed to determine whether query Q' follows query Q in any of the one or more query logs, step 1115.
  • a check is performed to determine whether query Q' follows query Q in the query logs for a given user's query session, wherein a query session may comprise the one or more queries submitted by a user during a given time period.
  • a 'pl2min' register is set to zero ("0"), step 1125, wherein the 'pl2min' register may comprise a memory device for storing a given numeric value.
  • the frequency with which query Q' follows query Q in the query logs is identified, step 1120.
  • the 'pl2min' register is set to the value of the quotient of the frequency with which query Q' follows query Q in the query logs, and the frequency of query Q in the query logs, step 1140.
  • the 'pi 2 min' register may be set to the value "7/12.”
  • the functions illustrated in FIGs. 3 through 11, and utilized by the similarity score function illustrated in Table A are not limited to the Japanese language and may be modified for one or more languages with multiple writing systems.
  • the similarity score function illustrated in Table A may utilize one or more combinations of the functions illustrated in FIGs. 3 through 11 in order to calculate a similarity score for a given query written according to one or more writing systems of a language with multiple writing systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to systems and methods for identifying one or more queries related to a given query. The method of the present invention comprises receiving a query written according to one or more writing systems of a language with multiple writing systems. A candidate set of queries written according to one or more writing systems of the language with multiple writing systems is identified. A score is calculated for the one or more queries in the candidate set indicating the similarity of the one or more queries with respect to the query received.

Description

SYSTEM AND METHOD FOR IDENTIFYING RELATED QUERIES FOR LANGUAGES WITH MULTIPLE WRITING SYSTEMS
COPYRIGHT NOTICE
[0001] A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
CROSS REFERENCE TO RELATED APPLICATIONS
[0002] This application is related to the following pending applications, each of which is hereby incorporated herein by reference in its entirety:
• U.S. Patent Application Serial No. 11/200,851, entitled "SYSTEM AND METHOD FOR DETERMINING ALTERNATE SEARCH QUERIES," filed August 10, 2005; and
• U.S. Provisional Application No. 60/736,133, entitled "MODULAR OPTIMIZED DYNAMIC SETS," filed November 9, 2005.
BACKGROUND OF THE INVENTION
[0003] The present invention generally provides methods and systems for identifying one or more queries related to a given search query written according to a language with multiple writing systems. More specifically, the present invention provides methods and systems for receiving a search query written according to combinations of one or more writings systems of a language with multiple writings systems and identifying one or more related queries from a candidate set of queries.
[0004] With the advent of the Internet and the multitude of web pages, media content, advertisements, etc., available to a user over the World Wide Web ("the web"), there has become a need to provide users with streamlined approaches to obtain relevant information from the web. Search systems and processes have been developed to meet the needs of users to obtain such information. Examples of such technologies can be accessed through Yahoo!, Google and other search provider web sites. [0005] Currently, users may employ client devices (such as personal computers (PCs), PDA's, smartphones, etc.) with access to wide area networks, e.g., the Internet, to search for and retrieve content. Typically, a user inputs a query via a client device and a search process returns one or more items of content, such as links, documents, web pages, advertisements, etc., related to the query. The items of content returned in response to a given query may be closely related or entirely unrelated to the subject or topic that the user was actually seeking. The success of a given search, which may be measured based upon how closely related the items of content retrieved are to a given query, may depend significantly upon the proper interpretation of a search query.
[0006] A query is made up of one or more words and phrases. Queries entered by human users, however, often fail to adequately describe the content that a given user may be seeking. Moreover, users may have only a general or vague idea of the content they may be seeking. For example, a user may wish to conduct a search, using the Yahoo! search engine, for a product advertised on television. The user may not know the name of the product, the manufacturer, etc., and may only be able to generally describe the product. Therefore, the query formulated by the user may be too broad, resulting in the retrieval of content items entirely unrelated to the content sought by the user. Similarly, the query terms selected by the user may fail to adequately describe the product, resulting in the retrieval of few if any content items. [0007] Current techniques are known for generating a candidate set of queries that may be related to a given query. For example, a user may enter the query "Apple® MP3 player" and be presented with one or more related queries, such as "IPOD®," "Itunes®," etc. Search providers, however, are presented with the challenge of identifying one or more queries from a candidate set of queries that are the most relevant or closely related in meaning to a given query. Moreover, certain languages such as Japanese have multiple writing systems, which further increases the complexity of identifying queries from a candidate set of queries that are the most relevant or similar in meaning to given query. For example, a single Japanese query submitted to a search engine may be written according to varying combinations of one or more Japanese writing systems, such as Kanji, Katakana, Hiragana, JASCII, ASCII, etc. A query written according to the Japanese Kanji writing system may look entirely different than a query written according to the Japanese Katakana and Hiragana writing systems, however, the two queries may have very similar or identical meanings.
[0008] Additionally, search providers, such as Yahoo!, MSN, or Google may utilize a bidding marketplace whereby advertisers may bid upon terms in order to have one or more advertisements displayed in response to a query. For example, one or more advertisers may wish to display one or more advertisements for laptop computers and accordingly may bid upon the terms "notebook computer." The terms "notebook computer," however, may be written according to one or more writing systems of a language with multiple writing systems, such as Japanese. For example, the terms "notebook computer" may be written according to the Japanese Hiragana writing system, the Japanese Katakana writing system, etc. [0009] A user may submit a query comprising the terms "notebook computer" to a given search provider, such as Yahoo!, written according to the Japanese Katakana writing system. The one or more advertisements with associated bids for the Katakana terms "notebook computer" may be retrieved and displayed to the user. In a bidding marketplace, the advertisement associated with the advertiser that provided the greatest bid for the Katakana terms "notebook computer" may be displayed in the most prominent position of a web page, e.g., ranked first in a ranked list of advertisements, displayed at the top of a given search results page, etc.
[0010] If the user selects one or more of the advertisements displayed, the search provider may monetize the selection of the user, such as by charging the advertiser associated with the advertisement selected an amount of money based upon the advertiser's bid. Retrieving and displaying only the advertisements that have associated bids for one or more terms, however, may result in a significant loss of revenue to a given search provider. For example, if a user enters a query comprised of terms that have not been bid upon by one or more advertisers, the search provider may fail to return any advertisements to the user, resulting in a loss of revenue to the search provider as the user will be unable to select any results. With reference to the abovementioned example, if the query entered by the user did not comprise the Katakana terms "notebook computer," but instead comprised the Hiragana terms "laptop computer," the search provider may not display properly targeted advertisements despite the similarity in meaning of the Katakana query "laptop computer" and the Hiragana query "notebook computer." [0011] While techniques exist for identifying one or more queries from a candidate set of queries that are identical or similar in meaning to a given query, existing techniques are limited to languages written according to a single writing system. Current techniques thus fail to provide for the identification of queries that are most relevant or closely related in meaning to an original query that is written according to one or more writing systems of a language with multiple writing systems. In order to overcome shortcomings associated with existing techniques, the present invention provides systems and methods for identifying one or more queries from a candidate set of related queries that are the most similar in meaning with respect to a given search query, written according to one or more writing systems of a language with multiple writing systems.
SUMMARY OF THE INVENTION
[0012] The present invention is directed towards methods and systems for identifying one or more queries related to a given query. The method of the present invention comprises receiving a query written according to one or more writing systems of a language with multiple writing systems. According to one embodiment of the invention, the query received comprises a query written according to a combination of one or more Japanese writing systems, including the Japanese Hiragana, Katakana, Kana, Romaji, JASCII, and Kanji writing systems. [0013] A candidate set of queries written according to one or more writing systems of the language with multiple writing systems associated with the query received is identified. According to one embodiment of the invention, the candidate set of queries comprises one or more queries related to the query received as indicated in one or more query logs. [0014] The method further comprises calculating a score for the one or more queries in the candidate set indicating the similarity of the one or more queries with respect to the query received. The score calculated for the one or more queries in the candidate set indicates the similarity in meaning of a given query from the candidate set with respect to the received query. According to one embodiment of the invention, calculating a score comprises calculating a character edit distance between the received query and a query selected from the candidate set after converting the one or more characters in each query to Roman characters. According to another embodiment of the invention, calculating a score comprises calculating a character edit distance between the received query and a query selected from the candidate set after converting the one or more characters in each query to Roman characters and removing space characters from each query. According to a further embodiment of the invention, calculating a score comprises converting the characters of the query received and a query selected from the candidate set to Roman characters, and calculating the difference between one ("1") and the quotient of the number of unique space-separated co-occurring words in the received query and the selected query and the total number of unique space-separated words in both queries. [0015] According to yet another embodiment of the invention, calculating a score comprises identifying whether a digit is unique to the received query and a query selected from the candidate set. According to a further embodiment, calculating a score comprises calculating a difference between the value one ("1") and quotient of the number of co-occurring Japanese Kanji characters in the received query and a selected query from the candidate set, and the total number of unique Japanese Kanji characters in the received query and the selected query from candidate set. According to another embodiment of the invention, calculating a score comprises converting the one or more characters of the received query and a query selected from the candidate set to Roman characters and calculating a number of Roman characters the queries have in common. According to yet another embodiment of the invention, calculating a score comprises identifying whether either the received query or a selected query from the candidate set contain a non-Roman character. According to yet another embodiment of the invention, calculating a score comprises calculating a character edit distance between the received query and a selected query from the candidate set after converting the Japanese Kanji characters of each query to Japanese Kana characters and removing all non-Japanese characters from each query. According to a further embodiment, calculating a score comprises calculating a quotient of the frequency with which a selected query from the candidate set follows the received query in one or more query logs and the frequency of the received query in the one or more query logs. [0016] The method further comprises selecting one or more of the queries from the candidate set for distribution. According to one embodiment of the invention, the one or more queries selected from the candidate set for distribution comprise queries with scores exceeding a given threshold. The one or more queries selected for distribution may be distributed. According to one embodiment of the invention, the queries selected for distribution are embedded in one or more web pages.
[0017] The invention is also directed towards a system for identifying one or more queries related to a given query. The system of the present invention comprises a search engine operative to receive a query written according to one or more writing systems of a language with multiple writing systems. According to one embodiment of the invention, the search engine is operative to receive a query written according to one or more Japanese writing systems. The search engine is further operative to identify a candidate set of one or more queries written according to one or more writing systems of the language with multiple writing systems associated with the query received. According to one embodiment of the invention, the search engine is operative to identify a candidate set comprised of one or more queries related to the received query as indicated in one or more query logs. [0018] A conversion component is operative to convert the received query and the one or more queries in the candidate set into one or more written formats. According to one embodiment of the invention, the conversion component is operative to convert a query into one or more written formats in accordance with one or more writing systems. [0019] A similarity component is operative to calculate a score for the one or more queries in the candidate set indicating the similarity of the one or more queries with respect to the received query. The similarity component is operative to calculate a score indicating the similarity in meaning of a selected query from the candidate set with respect to the received query. According to one embodiment of the invention, the similarity component is operative to calculate a character edit distance between the received query and a selected query from the candidate set. According to further embodiment of the invention, the similarity component is operative to calculate a difference between one ("1") and the quotient of the number of unique space-separated co-occurring words in the received query and a query selected from the candidate set, and the total number of unique space-separated words in both queries. According to yet another embodiment of the invention, the similarity component is operative to identify whether a digit is unique to the received query or a selected query from the candidate set. [0020] According to another embodiment, the similarity component is operative to calculate a difference between one ("1") and the quotient of the number of co-occurring Japanese Kanji characters in the received query and a query selected from the candidate set, and the total number of unique Japanese Kanji characters in the both queries. According to a further embodiment of the invention, the similarity component is operative to calculate a number of characters the received query and a selected query from the candidate set have in common. According to yet another embodiment of the invention, the similarity component is operative identify whether the received query or a selected query from the candidate set contain one or more characters of a given writing system. According to a further embodiment, the similarity component is operative to calculate a quotient of the frequency with which a selected query from the candidate set follows the received query in one or more query logs and the frequency of the received query in the query logs.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] The invention is illustrated in the figures of the accompanying drawings which are meant to be exemplary and not limiting, in which like references are intended to refer to like or corresponding parts, and in which:
[0022] FIG. 1 is a block diagram presenting a system for identifying one or more related queries written in accordance with combinations of one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention;
[0023] FIG. 2 is a flow diagram illustrating one embodiment of a method for selecting one or more related queries written in accordance with a combination of one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention;
[0024] FIG. 3 is a flow diagram illustrating one embodiment of a method for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention;
[0025] FIG. 4 is a flow diagram illustrating another embodiment for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention;
[0026] FIG. 5 is a flow diagram illustrating one embodiment of a method for calculating the word edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention;
[0027] FIG. 6 is a flow diagram illustrating one embodiment of a method for identifying differences in the digits appearing in two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention;
[0028] FIG. 7 is a flow diagram illustrating one embodiment of a method for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems considering the characters of only one of the writings systems, according to one embodiment of the present invention;
[0029] FIG. 8 is a flow diagram illustrating one embodiment of a method for identifying the number of characters overlapping in the prefixes of two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention;
[0030] FIG. 9 is a flow diagram illustrating one embodiment of a method for identifying whether two queries written in accordance with one or more writing systems of a language with multiple writing systems have non-Roman characters, according to one embodiment of the present invention.
[0031] FIG. 10 is a flow diagram illustrating one embodiment of a method for calculating the character edit distance between two queries written in accordance with one or more writing systems of a language with multiple writing systems after both queries have been converted to a given writing system, according to one embodiment of the present invention; and
[0032] FIG. 11 is a flow diagram illustrating one embodiment of a method for calculating the query and phrase substitution probability of two queries written in accordance with one or more writing systems of a language with multiple writing systems, according to one embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT [0033] In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention. [0034] FIG. 1 presents a block diagram depicting one embodiment of a system for identifying one or more queries related to a given query written according to one or more writing systems of a language with multiple writing systems. According to the embodiment of FIG. 1, client devices 124a, 124b and 124c are communicatively coupled to a network 122, which may include a connection to one or more local and/or wide area networks, such as the Internet. According to one embodiment of the invention, a client device 124a, 124b and 124c is a general purpose personal computer comprising a processor, transient and persistent storage devices, input/output subsystem and bus to provide a communications path between components comprising the general purpose personal computer. For example, a 3.5 GHz Pentium 4 personal computer with 512 MB of RAM, 40 GB of hard drive storage space and an Ethernet interface to a network. Other client devices are considered to fall within the scope of the present invention including, but not limited to, hand held devices, set top terminals, mobile handsets, PDAs, etc. [0035] Users of client devices 124a, 124b and 124c communicatively coupled to the network 122 may submit search queries, comprising one or more terms, to the search provider 100. A search query submitted by a user via the network 122 to the search provider 100 may comprise one or more characters, terms, or phrases written according to one or more writing systems of a language with multiple writing systems. For example, users of client devices 124a, 124b, and 124c may formulate queries comprising Japanese Kanji characters, Japanese Katakana characters and JASCII characters. Similarly, users of client devices 124a, 124b and 124c may formulate queries comprising Japanese Romaji characters, Japanese Hiragana characters, and digits. For example, a user may submit the following query, written according to a combination of the Japanese Katakana, Hiragana, Kanji and ASCII writing systems: 1 V y h ΛOtM RK^ λ) % ' .
[0036] The one or more search queries submitted by users of client devices 124a, 124b, and 124c, which may comprise characters and terms written according to one or more writing systems of a language with multiple writing systems, may be used by the search engine 107 at the search provider 100 to identify a candidate set of related queries. The one or more queries comprising the candidate set of related queries may be maintained in one or more local or remote data stores 102 and 108, respectively, which are operative to maintain one or more queries that may be related to a given query. According to one embodiment of the invention, data stores 102 and 108 are operative to maintain indices with entries identifying a set of queries related to one or more queries or terms. The indices maintained by data stores 102 and 108 may be supplemented with human editorial information indicating terms or queries that are related. For example, an index entry in data stores 102 and 108 may comprise the query "1 V y h /IOM
V #" written in accordance with the Japanese Katakana, Hiragana, Kanji, and ASCII
writing systems, and one or more related queries or terms written in accordance with one or more Japanese writing systems.
[0037] Data stores 102 and 108 may be implemented as databases or any other type of storage structures capable of providing for the retrieval and storage of one or more sets of queries, such as a database, CD-ROM, tape, digital storage library, etc. The queries maintained in data stores 102 and 108 may comprise queries written according to one or more writing systems of a given language with multiple writing systems. For example, the queries maintained in data stores 102 and 108 may comprise queries written in accordance with the Japanase Kanji, Hiragana, Katakana, JASCII, and Romaji writing systems.
[0038] According to another embodiment of the present invention, a candidate set of related queries identified by the search engine 107 comprises one or more sequential pairs of queries that co-occur with statistical significance in one or more query logs. The search engine 107 may utilize query logs to identify a candidate set comprising one or more queries related to a query received from a client device 124a, 124b, and 124c. The plurality of queries submitted by users to the search provider 100, which may be written according to one or more writing systems of a language with multiple writing systems, may be maintained in a query log component 106. The query log component 106 may be implemented as a database or similar storage structure capable of providing for the storage of one or more queries written according to one or more writing systems.
[0039] The query log component 106 may maintain information identifying the frequency with which queries are submitted to the search provider 100. Similarly, the query log component 106 may maintain information identifying the frequency with which a given query follows a related query. For example, during a given session a user conducting a search may submit a query comprising the terms "intellectual property," written according to one or more writing systems of a language with multiple writing systems, e.g., Japanese. During the same session, the user may submit a query comprising the terms "patent attorney," written according to one or more Japanese writing systems. The query log component 106 may maintain information identifying the frequency with which the query "patent attorney" follows the query "intellectual property" during a given user's session.
[0040] The search engine 107 may utilize the query logs maintained by the query log component 106 to identify a candidate set comprising one or more queries that are statistically significantly related to a query received from a given client device 124a, 124b, and 124c. The one or more queries identified as related to a given query, as indicated by the query logs maintained in the query log component 106, may be used to supplement or generate a candidate set of related queries. The candidate set of related queries may comprise queries written according to one or more writing systems of a given language with multiple writing systems, such as Japanese. Exemplary methods for identifying one or more queries related to a given query using query logs are described in commonly owned U.S. Patent Application Serial No. 11/200,851, entitled "SYSTEM AND METHOD FOR DETERMINING ALTERNATE SEARCH QUERIES," and U.S. Provisional Application No. 60/736,133, entitled "MODULAR OPTIMIZED DYNAMIC SETS," the disclosures of which are hereby incorporated by reference in their entirety. [0041] A similarity component 104 uses the candidate set identified by the search engine 107 to calculate a similarity score for the one or more queries in the candidate set of related queries. The similarity component 104 is operative to select a given query Q' from the candidate set of related queries and calculate a similarity score for Q', indicating the strength of the similarity in meaning of Q' with respect to a given query Q received from a given client device 124a, 124b and 124c. The similarity component 104 is operative to calculate a similarity score for each of the one or more queries in the candidate set of related queries identified by the search engine 107 according to methods described herein.
[0042] The similarity component 104 may utilize a conversion component 110 to calculate a similarity score for each query Q' in the candidate set of related queries identified by the search engine 107. According to one embodiment of the invention, the conversion component 110 converts a given query into one or more or written formats. The one or more written formats of a given query Q' generated by the conversion component 110 may be delivered to the similarity component 104 to facilitate the calculation of a similarity score. For example, the similarity component 104 may perform numerous comparisons with respect to a given query Q, received from a user, and a related query Q', selected from a candidate set of related queries, in order to calculate an accurate similarity score. As previously described, however, the one or more queries in the candidate set of related queries may be written according to one or more writing systems of a given language with multiple writing systems. Similarly, a query received from a given client device 124a, 124b, and 124c may be written according to one or more systems of a given language with multiple writing systems. The one or more comparisons performed by the similarity component 104 may require a query received from a user, Q, and a given query selected from the candidate set of related queries, Q', to be expressed in accordance with a particular writing system. For example, the similarity component 104 may require that the one or more JASCII characters of a given query Q and a related query Q' be converted to ASCII characters in order to compare the two queries. [0043] In order to compare a query Q and a query Q', which may be written according to different writing systems, the similarity component 104 may deliver a given query to the conversion component 110. According to one embodiment of the invention, the conversion component 110 is operative to identify the language and writing system associated with a given query and convert the query into one or more alternate written formats. The candidate set identified by the search engine 107 may comprise queries written according to a variety of writing systems of a given language with multiple writing systems, such as the Japanese Kanji, Kana, JASCII, and Romaji writing systems. The conversion component 110 is operative to identify a query as written according to one or more Japanese writing systems and convert the query into one or more alternate writing systems. For example, the conversion component 110 is operative to identify a query as written according to the Japanese Katakana writing system and convert the query in accordance with the Japanese Romaji writing system. Similarly, the conversion component 110 is operative to identify a query as comprising one or more JASCII characters and convert the one or more JASCII characters to ASCII characters in order facilitate the calculation of a similarity score by the similarity component 104.
[0044] According to one embodiment of the invention, the similarity scores calculated by the similarity component 104 for the one or more queries in the candidate set of related queries are used by a distribution component 116 to select one or more queries from the candidate set for distribution. The selection of queries based upon similarity scores allows for the selection of queries that are the most similar in meaning with respect to a given query Q. For example, the distribution component 116 may select one or more queries from the candidate set of related queries with similarity scores exceeding a given threshold. Similarly, the distribution component may select the N queries from the candidate set with the greatest similarity scores. Those of skill in the art recognize other techniques for selecting one or more queries from a candidate set using a similarity score. [0045] The distribution component 116 may distribute the one or more queries selected from the candidate set. According to one embodiment of the invention, the distribution component 116 displays the queries selected from the candidate set to a user via the network 122 as "suggested alternate queries" or "queries similar in meaning." Alternatively, or in conjunction with the foregoing, the distribution component 116 is operative to deliver the one or more queries selected to the search engine 107, which may embed the selected queries in a search result web page that may be viewed by a given user of a client device 124a, 124b, and 124c communicatively coupled to the network 122.
[0046] The similarity scores calculated by the similarity component 104 for the one or more queries in the candidate set may further be used to select one or more items of content, including advertisements, for distribution in response to a given request. According to one embodiment of the invention, advertisements may be maintained in the abovementioned data stores 102 and 108, or in one or more disparate data stores (not illustrated). The one or more local 102, remote 108, or disparate data stores are operative to maintain one or more advertisements and associated bids for terms corresponding to the advertisements. For example, a given advertiser may wish to display a given advertisement for notebook computers. The advertiser may thus bid on the terms "notebook computer" and identify the advertisement that is to be displayed in response to a query comprising the terms "notebook computer." When the search provider 100 receives a query, the search engine 107 may search local and remote data stores 102 and 108, or one or more disparate data stores, to determine whether one or more advertisers have provided bids for the one or more terms comprising the query received. If one or more bids for the terms comprising the query are identified, the advertisements associated with the bids for the one or more terms may be retrieved and displayed to the user on the user's client device 124a, 124b and 124c using the distribution component 116. If the user selects a given advertisement displayed, the advertiser associated with the advertisement selected may be charged a monetary sum in accordance with the advertiser's bid. [0047] An advertiser, however, may choose to bid upon terms written according to only a single writing system of a language with multiple writing systems. For example, an advertiser may choose to bid upon terms written according to only the Japanese Hiragana writing system. As previously described, however, the one or more search queries submitted by users of client devices 124a, 124b, and 124c may comprise terms and phrases written according to one or more writing systems. The search engine 107 may thus utilize the queries with similarity scores exceeding a given threshold to expand the breadth of advertisements retrieved in response to a given query. According to one embodiment of the invention, the search engine 107 identifies the one or more advertisements responsive to the terms comprising the one or more queries with similarity scores exceeding a given threshold. The one or more advertisements identified as responsive to the terms comprising the queries with similarity scores exceeding a given threshold may be selected for distribution to one or more client devices 124a, 124b, and 124c. [0048] For example, a user of a client device 124a, 124b and 124c may formulate a search query Q comprised of Japanese terms written according to both the Japanese Kanji and Romaji writing systems. The user may submit the query to the search provider 100 via the network 122. The search engine 107 may determine that no advertisers have provided bids for the Kanji and Romaji terms utilized by the user. Alternatively, or in conjunction with the foregoing, the search engine 107 may determine that displaying the advertisements corresponding to the bids associated with the Kanji and Romaji terms utilized by the user will result in little if any revenue. The search engine 107, however, may utilize the terms comprising the one or more queries selected from the candidate set with similarity scores exceeding a given threshold to identify one or more terms with associated bids. Similarly, the search engine 107 may utilize the terms comprising the one or more queries selected from the candidate set with similarity scores exceeding a given threshold to identify one or more terms with bids exceeding a given threshold. The search engine 107 may thereafter utilize the one or more terms with associated bids, or the one or more terras with associated bids exceeding a given threshold, to select one or more advertisements responsive to the search query Q formulated by the user. [0049] According to another example, assume a given query Q' selected from the candidate set with a similarity score exceeding a given threshold comprises Hiragana terms, whereas the abovementioned query Q formulated by the user comprises Kanji and Romaji terms. The search engine may utilize the one or more Hiragana terms comprising query Q' to determine whether one or more advertisers have bid upon the Hiragana terms comprising query Q'. Similarly, the search engine may determine whether one or more advertisers have provided bids for the one or more Hiragana terms comprising query Q' exceeding a given threshold. The search engine 107 may retrieve the one or more advertisements with associated bids for the terms comprising query Q' and deliver the one or more advertisements to the distribution component. According to one embodiment of the invention, the search engine 107 retrieves the one or more advertisements with the greatest associated bids for the one or more terms comprising query Q'. The distribution component 116 may thereafter distribute the one or more advertisements to the user that submitted the query Q.
[0050] While the above embodiment describes the receipt and processing of queries, the search provider 100 system illustrated in FIG. 1 is not limited to receiving and calculating similarity scores for queries, and may further be used to calculate similarity scores for one or more terms comprising one or more strings of text. Users of client devices 124a, 124b, and 124c may deliver to the search provider 100 one or more strings of text comprising one or more terms, including but not limited to, phrases, sentences, paragraphs, and documents written according to one or more writing systems of a language with multiple writing systems. Accordingly, the search provider 100 records a log of these one or more strings of text in one or more log files. The search provider 100 is operative to identify a candidate set comprising one or more items from its log files, wherein a given item comprises one or more sets of terms related to the one or more terms delivered by a given user of a client device 124a, 124b, and 124c. For example, a given item in the candidate set may comprise a phrase or a sentence. Similarly, a given item in the candidate set may comprise a paragraph or an entire document. The search provider may calculate a similarity score for the one or more items in the candidate set indicating the strength of the similarity in meaning of an item with respect to the one or more terms received from a client device 124a, 124b, and 124c.
[0051] FIG. 2 illustrates one embodiment of a method for selecting one or more queries Q' from a candidate set that are related in meaning to a given query Q, wherein query Q and Q' are written in accordance with one or more writing systems of a language with multiple writing systems. As illustrated in FIG. 2, a search query is received from a given user, step 205. The query may be received from a client device communicatively coupled to a network, such as the Internet, and comprise one or more terms or phrases written according to combinations of one or more writing systems of a language with multiple writing systems. For example, the query received from a user may comprise Japanese terms written in accordance with the Kanji, Katakana, and Hiragana writing systems.
[0052] A candidate set comprised of queries related to a given query Q formulated by a user is identified, step 210. The candidate set may be comprised of queries written according to one or more writing systems of the language associated with the user's query. For example, a given query Q may comprise terms written in accordance with the Japanese Katakana writing system, such as the query 7 ^ r X The candidate set of related queries may thus comprise one or
more queries written in accordance with one or more combinations of one or more Japanese writing systems. For example, the candidate set of queries related to the abovementioned Hiragana query y 9 ψ V may comprise the Romaji query rakuten, the Kanji query MX, the
Hiragana query h < X AJ, etc.
[0053] The candidate set of queries related to a given query Q may be generated using one or more query logs. According to one embodiment of the invention, query logs may identify one or more queries formulated by a user during a given query session. For example, during a given query session, a user may formulate a query comprising terms written according to the Japanese Hiragana and Kanji writing systems. During the same query session, a user may also formulate a query comprising terms written according to the Japanese Katakana and Romaji writing systems. An analysis may be performed to determine whether the two queries co-occur in the one or more query logs with statistical significance. According to one embodiment of the invention, a statistical significance threshold value may be used to select the one or more queries that are the most related to a given query Q as indicated by one or more query logs.
[0054] The candidate set may be generated with the one or more queries identified as related to the given query with statistical significance, or with statistical significance exceeding a given threshold value as indicated by one or more query logs. The one or more queries comprising the candidate set of related queries may be selected according to methods for determining statistically significantly related queries using query logs described in the above-identified applications incorporated by reference in their entirety.
[0055] A given query Q' is selected from the candidate set of related queries, step 215. According to the embodiment illustrated in FIG. 2, a similarity score is calculated for the query Q' selected, step 220. The similarity score calculated for a given query Q' provides a numerical value indicating the strength of the similarity of the meaning of query Q' with respect to the meaning of a given query Q, written according to one or more writing systems of a language with multiple writings systems. Table A illustrates one embodiment of an equation that may be used to calculate a similarity score for a given query Q'.
[0056] The equation presented in Table A may be used to calculate a score indicating the strength of the similarity in meaning of a given query Q' with respect to a given query Q, which may be written according to one or more Japanese writing systems, including, but not limited to Kanji, Kana, JASCII, Kana, Katakana, Romaji, and Hiragana. Those of skill in the art recognize that the equation illustrated in Table A may be modified so as to provide for the calculation of a similarity score for other languages with multiple writing systems. Similarity score(Q') = 1.47551 + levr (Q,Q') x -1.68821 + levrs(Q, Q') x 2.48700 + wordr (Q, Q') x 0.44366 + digit (Q,Q') x 0.75388 + kanjid(Q,Q') x 0.22496 + opr(Q,Q') x -0.40083 + Japanese (Q5Q') x 0.09368 + levk (Q5Q') x -0.32574 +pl2min(Q,Q') x -
0.33258
Table A
[0057] According to the equation presented in Table A, Q represents a given query written according to one or more Japanese writing systems. Q' represents a query selected from a candidate set of queries related to query Q. Lew is a function for calculating the character edit distance between Q and Q' after converting all Japanese characters to Roman characters. Lews is a function for calculating the character edit distance between Q and Q' after converting all Japanese characters to Roman characters and removing spaces. Wordr is the word edit distance between Q and Q' after converting all Japanese characters to Roman characters. Digit is a function for identifying whether Q contains any digits not appearing in Q' and vice versa. Kanjid is a function for determining whether either Q or Q' contains Kanji characters, and if so, identifying a Kanji disagreement between Q and Q'. Opr is a function for calculating the number of characters Q and Q' have in common starting from the leftmost character of each query and continuing until the first character disagreement, after all after all Japanese characters in each query have been converted to Roman characters. Levk is a function for calculating the character edit distance between Q and Q' after all Kanji characters have been converted to Kana characters and all non- Japanese characters have been removed. P12min is a function for calculating the query substitution probability of query Q' following query Q in a log of user query sessions. Embodiments of the functions utilized by the similarity score function illustrated in Table A are illustrated in FIG. 3 through FIG. 11.
[0058] A check is performed to determine whether a similarity score has been calculated for the one or more queries in the candidate set, step 225. If one or more queries in the candidate set do not have an associated similarity score, an additional query Q' is selected from the candidate set, step 215. Alternatively, if a similarity score has been calculated for the one or more queries in the candidate set, a given query Q' is selected from the candidate set, step 230. A check is performed to determine whether the similarity score associated with the query Q' selected from the candidate set exceeds a given similarity score threshold, step 235. According to one embodiment of the invention, the similarity score threshold comprises a numeric value that may be used to perform a comparison with the similarity score associated with a given query Q'. Because a similarity score indicates the strength of the similarity in meaning of a given query Q' with respect to a query Q, the use of a similarity score threshold facilitates the selection of one or more queries from the candidate set that are the most similar in meaning with respect to the query Q.
[0059] If the similarity score associated with a given query Q' exceeds the similarity score threshold, the query Q' is added to a distribution set, step 245. According to one embodiment of the invention, the distribution set comprises the one or more queries selected from the candidate set that have similarity scores exceeding the similarity score threshold. If the similarity score associated with a given query Q' does not exceed the similarity score threshold, the query Q' is not added to the distribution set, step 240.
[0060] A check is performed to determine whether there are additional queries in the candidate set that require analysis, step 250. If one or more queries in the candidate require analysis, an additional query Q' is selected from the candidate set, step 230. Alternatively, after all queries in the candidate set have been analyzed, and the distribution set has been populated with the one or more queries exceeding the similarity score threshold, the one or more queries in the distribution set are distributed, step 255.
[0061] The one or more queries in the distribution set of queries exceeding the similarity score threshold may be delivered to the user who submitted the query Q. According to one embodiment of the invention, the one or more queries in the distribution set are displayed to the user in a results web page. For example, a user may be presented with a web page comprising results, such as links to content items that are responsive to the query Q, as well the one or more Q' queries comprising the distribution set that are most similar in meaning with respect to the query Q. The one or more queries in the distribution set delivered to a given user may be displayed in a ranked list according to similarity score to indicate to the user the relative strength of the similarity in meaning of a given query Q' with respect to the query Q. [0062] FIGs. 3 through 11 illustrate embodiments of the functions presented in Table A that may be used to calculate a similarity score for a given query Q' selected from a candidate set of queries. As previously described, the plurality of functions illustrated in Table A, and further described in FIGs. 3 through 11, may be used to calculate a similarity score indicating the strength of the similarity in meaning of a given query Q' with respect to a query Q, written according to one or more Japanese writing systems. Those of skill in the art recognize, however, that the embodiments of the functions illustrated in FIGs. 3 through 11 are exemplary and not intended to be limited to the Japanese language and writing systems, and may be modified so as to provide for the calculation of a similarity score for other languages with multiple writing systems. Those of skill in the art further recognize that the functions illustrated in FIGs. 3 through 11 are not limited to calculating similarity scores for a candidate set comprising one or more queries related to a given query and may be used to calculate similarity scores for a candidate set of queries comprising one or more queries selected according to a plurality of techniques. Additionally, those of skill in the art recognize that the functions illustrated in FIGs. 3 through 11 are not limited to calculating similarity scores for a candidate set comprising one or more queries and may be modified so as to calculate similarity scores for one or more sets of terms, including but not limited to, phrases, sentences, paragraphs, and documents. [0063] FIG. 3 illustrates one embodiment of a method for calculating the character edit distance between a given query Q, written according to one or more Japanese writing systems, and a query Q' selected from a candidate set of queries. The method presented in FIG. 3 illustrates one embodiment of the levk function utilized by the similarity score function illustrated in Table A. [0064] The one or more characters comprising a query Q, which may be written according to one or more Japanese writing systems, such as Kanji, Katakana, Hiragana, etc., are converted to Roman characters, step 305. A given query Q' is selected from a candidate set comprised of one or more queries, step 310. The query Q' selected from the candidate set may be written according to one or more writing systems of the language associated with query Q. For example, Q' may be written according to the same writing system as the query Q or one or more alternate Japanese writing systems, such as the Japanese Romaji writing system, the Japanese Kana writing system, etc. A check is performed to determine whether the characters comprising Q' are in Roman character form, step 315. If query Q' is not in Roman character form, the one or more characters comprising Q' are converted into Roman characters, step 320. If the one or more terms comprising Q' are already in Roman character form, or after all the characters in Q' have been converted to Roman character form, a calculation is performed to identify the character edit distance between query Q and query Q', step 325. The character edit distance value may be provided to the similarity score function illustrated in Table A in order to calculate a similarity score for Q'.
[0065] FIG. 4 illustrates one embodiment of a method for calculating the character edit distance between a given query Q, written according to one or more Japanese writing systems, and a query Q' selected from a candidate set of queries. The embodiment illustrated in FIG. 4 provides one embodiment of the lews function used by the similarity score function illustrated in Table A.
[0066] According to the embodiment illustrated in FIG. 4, query Q, written according to one or more Japanese writing systems, such as Kanji, Katakana, or Hiragana, is converted to Roman character form, step 405. Thereafter, all space characters appearing in the Roman character form query Q are removed, step 408. For example, a given query Q may comprise the Kanji terms
1S^F;*7. After conversion to Roman character form, query Q may comprise the terms "densha otoko," and after removing spaces query Q may comprise the characters "denshaotoko." [0067] A given query Q' is selected from a candidate set comprising one or more queries, step 410. A check is performed to determine whether Q' is in Roman character form, step 415. If query Q' is not Roman character form, the one or more characters comprising query Q' are converted to Roman characters, step 420. If the characters comprising query Q' are already in Roman character form, or after the characters comprising query Q' have been converted to Roman character form, all spaces within query Q' are removed, step 425. The character edit distance between the Roman character form of query Q and Q' is thereafter calculated, step 430. The calculated character edit distance between query Q and Q' may be used by the similarity score function illustrated in Table A to calculate a similarity score for Q'. [0068] FIG. 5 illustrates one embodiment of the wordr function illustrated in Table A. The embodiment of the wordr function illustrated in FIG. 5 provides for the calculation of the word edit distance between a given query Q, written according to one or more Japanese writing systems, and a query Q', selected from a candidate set of queries. According to one embodiment of the invention, the word edit distance between a given query Q and a query Q' is the difference between the value one ("1") and the quotient of the number of unique space-separated co- occurring words in Q and Q' and the total number of unique space-separated words in both Q and Q'.
[0069] The characters comprising a given query Q, written according to one or more Japanese writing systems, are converted to Roman character form, step 505. Thereafter, a given query Q' is selected from a candidate set of queries, step 506. A check is performed to determine whether the query Q' is in Roman character form, step 508. If query Q' is not in Roman character form, the characters comprising query Q' are converted to Roman characters, step 510. If the characters comprising the query Q' are already in Roman character form, or after the characters comprising Q' have been converted to Roman character form, the number of unique space-separated co-occurring words in Q and Q' are identified, step 515. The quotient of the number of unique space-separated co-occurring words in Q and Q' and the total number of unique space-separated words in both Q and Q' is calculated, step 520. According to one embodiment of the invention, the number of unique space-separated co-occurring words comprises the number of unique words that appear in both a given query Q and a query Q'. Additionally, the total number of unique space-separated words in both Q and Q' comprises the sum of the unique space-separated words in a given query Q and a query Q'. [0070] The difference between the value one ("1") and the calculated quotient is calculated, step 525, and assigned to a 'wordr' register, step 530. According to one embodiment of the invention, the 'wordr' register comprises a memory device for storing a given numeric value. The value assigned to the 'wordr' register may be used by the similarity score function illustrated in Table A to calculate a similarity score for the query Q'.
[0071] For example, a given query Q, in Roman character form, may be comprised of the terms "kuruma kemuri." Similarly, a given query Q' in Roman character form may be comprised of the terms "sora kemuri." The number of unique space-separated co-occurring words in Q and Q' is one ("1"), namely, the word "kemuri," wherein the total number of unique space-separated words in both Q and Q' is three ("3"), namely the words "kuruma," "sora," and "kemuri." Thus, the quotient of the number of unique space-separated co-occurring words in Q and Q' and the total number of unique space-separated words in both Q and Q' is 1/3. Additionally, the difference between one ("1") and the calculated quotient is 2/3. The value 2/3 may be assigned to the 'wordr' register and used by the similarity score function illustrated in Table A to calculate a similarity score for query Q'.
[0072] FIG. 6 illustrates one embodiment of a method for determining whether a digit is unique to a given query Q, written according to one or more Japanese writing systems, in comparison with a query Q' selected from a candidate set of queries. The embodiment presented in FIG. 6 provides one embodiment of the digit function used by the similarity score function illustrated in Table A. [0073] A given query Q' is selected from a candidate set comprised of queries written according to one or more writing systems, step 605. A check is performed to determine whether a digit in a given query Q does not appear in query Q'. For example, a given query Q may
contain the Japanese Kanji digits / N I / ^ (corresponding to the value expressed by Arabic numerals "68"), and a given query Q' may contain the Japanese Kanji digits
^ ' ^ (corresponding to the value expressed by Arabic numerals "98"). The check
performed at step 610 would therefore determine that the Japanese Kanji digit J N is unique to query Q, as it does not appear in query Q'. Similarly, a given query Q may comprise the Japanese Kanji characters and the Arabic numerals M^. 2005, and a given query Q' may
comprise the Japanese Kanji characters and the Arabic numerals SU3*; 2004. The check
performed at step 610 would determine that the Arabic numeral 5 is unique to query Q, as it does not appear in query Q'.
[0074] If a digit is identified as appearing in query Q and not in query Q', a 'digit' register is set to the value one ("1"), indicating that query Q contains a digit not in query Q', step 620. According to one embodiment of the invention, the 'digit' register comprises a memory device for storing a given numeric value.
[0075] Alternatively, if Q' contains each of the one or more digits appearing in query Q, an additional check is performed to determine whether a digit in query Q' does not appear in query Q, step 615. If query Q' contains a digit which does not appear in query Q, the abovementioned 'digit' register is set to the value one ("1"), indicating that query Q' contains a digit unique to Q', step 620. Alternatively, if query Q contains each of the one or more digits in Q', the 'digit' register is set to zero ("0"), step 625, indicating that the one or more digits in query Q' appear in query Q, and vice versa. The value assigned to the 'digit' register, either zero ("0") or one ("1"), may be used by the similarity score function illustrated in Table A to calculate a similarity score for the query Q'. [0076] FIG. 7 presents one embodiment of the kanjid function used by the similarity score function illustrated in Table A. A given query Q, which may be written according to one or more Japanese writing systems, is received, step 705. A check is performed to determine whether query Q contains one or more Japanese Kanji characters, step 710. If query Q does not contain any Kanji characters, a 'kanjid' register is set to zero ("0"), step 708, wherein the 'kanjid' register may comprise a memory device for storing a given numeric value. Alternatively, if query Q contains one or more Kanji characters, a query Q' is selected from a candidate set of queries, step 715.
[0077] A check is performed to determine whether the query Q' selected from the candidate set contains one or more Kanji characters, step 720. If query Q' does not contain any Kanji characters, the aforementioned 'kanjid' register is set to zero ("0"), step 708. In contrast, if Q' contains one or more Kanji characters, the one or more non-Kanji characters in Q and Q' are removed, step 722. The number of unique Kanji characters co-occurring in query Q and query Q' are thereafter identified, step 725. For example, if after removing non-Kanji characters query
Q is comprised of Kanji characters
Figure imgf000028_0001
and if after removing non-Kanji characters query
Q' is comprised of Kanji characters 5$c5ζ, the number of unique co-occurring Kanji characters
in Q and Q' is two ("2"), namely i^ξ.
[0078] The total number of unique Kanji characters in both Q and Q' is thereafter identified, step 727. For example, the total number of unique Kanji characters in both Q and Q', wherein Q
is comprised of Kanji characters S^^ζ rfϊ^, and Q' is comprised of Kanji characters 5$c^ζ, is
six ("6"), namely the unique Kanji characters ^S^Tff ^r from query Q, and the unique Kanji
characters ffii from query Q'. The quotient of the number of co-occurring Kanji characters
and the total unique Kanji characters is calculated, step 730. The 'kanjid' register is set to the value of the difference between one ("1") and the calculated quotient, step 735. The 'kanjid' register value may be used by the similarity score function illustrated in Table A to calculate a similarity score for Q'.
[0079] FIG. 8 illustrates one embodiment of a method for identifying the number of characters overlapping in the prefixes of a given query Q, written according to one or more Japanese writings systems, and a query Q' selected from a candidate set of queries, starting with a comparison of the leftmost characters of each query and continuing until the first character disagreement. The method presented in FIG. 8 illustrates one embodiment of the opr function utilized by the similarity score function illustrated in Table A.
[0080] A given query Q, written according to one or more Japanese writing systems, is converted to Roman character form, step 805. A query Q' is selected from a candidate set of queries, step 810. A check is performed to determine whether the one or more characters comprising query Q' are in Roman character form, step 815. If the one or more characters comprising query Q' are not in Roman character form, the characters are converted to Roman characters, step 820. If the characters comprising Q' are already in Roman character form, or after the one or more characters comprising Q' have been converted to Roman character form, the first Roman characters of query Q and query Q' are selected, step 825. [0081] A check is performed to determine whether the first character selected from query Q and the first character selected from query Q' match, step 835. If the first characters selected from Q and Q' do not match, processing terminates, step 830. Alternatively, if the characters selected match, a character match count register is incremented, step 850, indicating that a character match for query Q and query Q' was identified. According to one embodiment of the invention, the character match count register is initialized with the value zero ("0") and incremented as characters from query Q and query Q' are identified as matching. [0082] The next character from Q and Q' are selected, step 840, and a check is performed to determine whether the next characters match, step 835. If the characters selected from Q and Q' do not match, the character match count register is not incremented and processing ends, step 830. When processing terminates, step 830, the value in the character match count register will indicate the number of characters that match in Q and Q'. The value in the character match count register is utilized by the similarity score function illustrated in Table A to calculate a similarity score for query Q'.
[0083] FIG. 9 illustrates one embodiment of a method for identifying whether a given query Q, written according to one or more Japanese writing systems, or a query Q', selected from a candidate set of queries, contain non-Roman characters. The embodiment presented in FIG. 9 illustrates the Japanese function used by the similarity score function illustrated in Table A. [0084] A given query Q, written according to one or more Japanese writing systems, is received, step 905. A check is performed to determine whether query Q contains one or more non-Roman characters, step 910. If query Q contains one or more non-Roman characters, a 'Japanese' register is set to the value one ("1"), step 908. According to one embodiment of the invention, the 'Japanese' register comprises a memory device for storing a given numeric value. [0085] If query Q does not contain one or more non-Roman characters, a query Q' is selected from a candidate set comprising one or more queries, step 915. A check is performed to determine whether query Q' contains one or more non-Roman characters, step 920. If query Q' contains one or more non-Roman characters, the 'Japanese' register is set to the value ("1"), step 908. Alternatively, if query Q' contains only non-Roman characters, the 'Japanese' register is set to the value zero ("0"), step 922, and processing is thereafter terminated, step 925. The value maintained in the 'Japanese' register may be utilized by the similarity score function illustrated in Table A to calculate a similarity score for query Q'.
[0086] FIG. 10 illustrates one embodiment of a method for determining the character edit distance between a given query Q and a query Q' after all Kanji and non- Japanese characters have been removed from each respective query. The method presented in FIG. 10 illustrates one embodiment of the levk function utilized by the similarity score function illustrated in Table A. [0087] As illustrated in FIG. 10, a given query Q' is selected from a candidate set of queries, step 1005. A check is performed to determine whether query Q' or a given query Q, written according to one or more Japanese writing systems, contains one or more Kanji characters, step 1010. If either query Q or query Q' contain one or more Kanji characters, the Kanji characters in each respective query are converted to Kana characters, step 1015. For example, query Q may
be comprised of both Kanji characters and Arabic numerals, such as ^ ^^ . After converting the Kanji character to Kana characters, query Q may comprise the characters
£> £ 200
[0088] If neither query Q or query Q' contain Kanji characters, or after all Kanji characters in each respective query have been converted to Kana characters, an additional check is performed to determine whether either query contains non-Japanese characters, step 1020. According to one embodiment of the invention, non- Japanese characters comprise characters not written according to one or more Japanese writing systems. For example, if query Q includes
Kana characters and Arabic numerals, such as ^/ <— ^^ , the Arabic numeral "200" may comprise non- Japanese characters.
[0089] If either query Q or query Q' contain non- Japanese characters, the non- Japanese characters are removed, step 1025. With reference to the abovementioned example, after removing non- Japanese characters from query Q, namely the Arabic numeral "200," query Q
may comprise the Kana characters ^ <—. If neither query Q nor query Q' contain non- Japanese characters, or after all non-Japanese characters have been removed, the character edit distance between Q and Q' is calculated, step 1030. The character edit distance between query Q and query Q' may be used by the similarity score function illustrated in Table A to calculate a similarity score for Q'.
[0090] FIG. 11 presents one embodiment of the pl2min function utilized by the similarity score function illustrated in Table A. According to one embodiment of the invention, \hepl2min function calculates the query substitution probability of a given query Q' following a given query Q, and may also be used to calculate the phrase substitution of a phrase P' following a given phrase P. For example, one or more query logs may be maintained identifying the one or more queries and phrases submitted by a given user during a query session. The query logs may identify the order of the one or more queries and phrases submitted by the user, for example, to provide an indication of how the user refined a query Q, how the user rewrote a query Q, how the user utilized one or more alternate writing systems of a language with multiple writings systems to express a query Q, etc. The query logs may further indicate the frequency with which one or more users submitted one or more queries or phrases.
[0091] The frequency with which a given query Q appears in one or more query logs is identified, step 1105. A given query Q' is selected from a candidate set of queries, step 1110. A check is performed to determine whether query Q' follows query Q in any of the one or more query logs, step 1115. According to one embodiment of the invention, a check is performed to determine whether query Q' follows query Q in the query logs for a given user's query session, wherein a query session may comprise the one or more queries submitted by a user during a given time period.
[0092] If query Q' does not follow query Q in any of the one or more query logs, a 'pl2min' register is set to zero ("0"), step 1125, wherein the 'pl2min' register may comprise a memory device for storing a given numeric value. Alternatively, if query Q' is identified as following Q in on or more of the query logs, the frequency with which query Q' follows query Q in the query logs is identified, step 1120. The 'pl2min' register is set to the value of the quotient of the frequency with which query Q' follows query Q in the query logs, and the frequency of query Q in the query logs, step 1140. For example, if query Q appears in the query logs twelve ("12") times and Q' follows query Q seven ("7") times in the query logs, the 'pi 2 min' register may be set to the value "7/12." [0093] Those of skill in the art recognize that the functions illustrated in FIGs. 3 through 11, and utilized by the similarity score function illustrated in Table A, are not limited to the Japanese language and may be modified for one or more languages with multiple writing systems. Those of skill in the art further recognize that the similarity score function illustrated in Table A may utilize one or more combinations of the functions illustrated in FIGs. 3 through 11 in order to calculate a similarity score for a given query written according to one or more writing systems of a language with multiple writing systems.
[0094] While the invention has been described and illustrated in connection with preferred embodiments, many variations and modifications as will be evident to those skilled in the art may be made without departing from the spirit and scope of the invention, and the invention is thus not to be limited to the precise details of methodology or construction set forth above as such variations and modifications are intended to be included within the scope of the invention.

Claims

We claim:
1. A method for identifying one or more queries related to a given query, the method comprising: receiving a query written according to one or more writing systems of a language with multiple writing systems; identifying a candidate set of queries written according to one or more writing systems of the language with multiple writing systems; and calculating a score for the one or more queries in the candidate set indicating the similarity of the one or more queries with respect to the query received.
2. The method of claim 1 wherein receiving the query comprises receiving a query written according to a combination of one or more Japanese writing systems.
3. The method of claim 1 wherein identifying the candidate set of queries comprises identifying a set of one or more queries related to the query received.
4. The method of claim 3 wherein identifying the candidate set of queries related to the query received comprises identifying one or more queries related to the query received as indicated in one or more query logs.
5. The method of claim 1 wherein receiving the query comprises receiving a query written according to the Japanese Hiragana writing system.
6. The method of claim 1 wherein receiving the query comprises receiving a query written according to the Japanese Katakana writing system.
7. The method of claim 1 wherein receiving the query comprises receiving a query written according to the Japanese Kana writing system.
8. The method of claim 1 wherein receiving the query comprises receiving a query written according to the Japanese Romjai writing system.
9. The method of claim 1 wherein receiving the query comprises receiving a query written according to the Japanese JASCII writing system.
10. The method of claim 1 wherein receiving the query comprises receiving a query written according to the Japanese Kanji writing system.
11. The method of claim 1 wherein receiving the query comprises receiving a set of terms comprising a phrase.
12. The method of claim 1 wherein calculating a score for the one or more queries in the candidate set comprises calculating a score indicating the similarity in meaning of a given query from the candidate with respect to the received query.
13. The method of claim 1 wherein calculating a score comprises: converting one or more characters of the received query to Roman characters; converting one or more characters of a query selected from the candidate set to
Roman characters; and calculating a character edit distance between the received query and the selected query from the candidate set.
14. The method of claim 1 wherein calculating a score comprises: converting one or more characters of the received query to Roman characters; converting one or more characters of a selected query from the candidate set to
Roman characters; removing space characters from the received query and the selected query from the candidate set; and calculating a character edit distance between the received query and the selected query from the candidate set.
15. The method of claim 1 wherein calculating a score comprises: converting one or more characters of the received query to Roman characters; converting one or more characters of a query selected from the candidate set to
Roman characters; identifying a number of unique space-separated co-occurring words in the received query and the selected query ; identifying a total number of unique space-separated words in both the received query and the selected query; calculating a quotient of the number of unique space-separated co-occurring words and the total number of unique space-separated words in both queries; and calculating a difference between the numerical value one ("1") and the calculated quotient.
16. The method of claim 1 wherein calculating a score comprises identifying whether a digit is unique to the received query or a selected query from the candidate set.
17. The method of claim 1 wherein calculating a score comprises: identifying a number of co-occurring Japanese Kanji characters in the received query and a selected query from the candidate set; identifying a total number of unique Japanese Kanji characters in the received query and the selected query from the candidate set; calculating a quotient of the number of co-occurring Japanese Kanji characters and the total number of unique Japanese Kanji characters; and calculating a difference between the numerical value one ("1") and the calculated quotient.
18. The method of claim 1 wherein calculating a score comprises: converting one or more characters of the received query to Roman characters; converting one or more characters of a selected query from the candidate set to
Roman characters; and calculating a number of Roman characters the received query and the selected query have in common.
19. The method of claim 1 wherein calculating a score comprises identifying whether either the received query or a selected query from the candidate set contain a non- Roman character.
20. The method of claim 1 wherein calculating a score comprises: converting one or more Japanese Kanji characters of the received query to
Japanese Kana characters; converting one or more Japanese Kanji characters of a selected query from the candidate set to Japanese Kana characters; removing all non- Japanese characters from the received query and the selected query from the candidate set; and calculating a character edit distance between the received query and the selected query from the candidate set.
21. The method of claim 1 wherein calculating a score comprises calculating a quotient of the frequency with which a selected query from the candidate set follows the received query in one or more query logs and the frequency of the received query in the one or more query logs.
22. The method of claim 1 comprising selecting one or more of the queries from the candidate set for distribution.
23. The method of claim 22 wherein selecting one or more of the queries from the candidate set for distribution comprises selecting one or more queries with scores exceeding a given threshold.
24. The method of claim 1 comprising distributing the one or more queries from the candidate set with scores exceeding a given threshold.
25. The method of claim 24 wherein distributing the one or more queries from the candidate set comprises embedding the one or more queries in a web page.
26. A system for identifying one or more queries related to a given query, the system comprising: a search engine operative to: receive a query written according to one or more writing systems of a language with multiple writing systems, and identify a candidate set of one or more queries written according to one or more writing systems of the language with multiple writing systems; a conversion component operative to convert the received query and the one or more queries in the candidate set into one or more written formats; and a similarity component operative to calculate a score for the one or more queries in the candidate set indicating the similarity of the one or more queries with respect to the received query.
27. The system of claim 26 wherein the search engine is operative to receive a query written according to one or more Japanese writing systems.
28. The system of claim 26 wherein the search engine is operative to identify a candidate set comprised of one or more queries related to the received query.
29. The system of claim 28 wherein the search engine is operative to search one or more query logs to identify one or more queries related to the received query.
30. The system of claim 26 wherein the conversion component is operative to convert a query into one or more written formats in accordance with one or more writing systems.
31. The system of claim 26 wherein the similarity component is operative to calculate a score indicating the similarity in meaning of a selected query from the candidate set with respect to the received query.
32. The system of claim 26 wherein the similarity component is operative to calculate a character edit distance between the received query and a selected query from the candidate set.
33. The system of claim 26 wherein the similarity component is operative to: identify a number of unique space-separated co-occurring words in the received query and the selected query; identify a total number of unique space-separated words in both the received query and the selected query; calculate a quotient of the number of unique space-separated co-occurring words and the total number of unique space-separated words in both queries; and calculate a difference between the numerical value one ("1") and the calculated quotient.
34. The system of claim 26 wherein the similarity component is operative to identify whether a digit is unique to the received query or a selected query from the candidate set.
35. The system of claim 26 wherein the similarity component is operative to: identify a number of co-occurring Japanese Kanji characters in the received query and a selected query from the candidate set; identify a total number of unique Japanese Kanji characters in the received query and the selected query from the candidate set; calculate a quotient of the number of co-occurring Japanese Kanji characters and the total number of unique Japanese Kanji characters calculate a difference between the numerical value one ("1") and the calculated quotient.
36. The system of claim 26 wherein the similarity component is operative to calculate a number of characters the received query and a selected query from the candidate set have in common.
37. The system of claim 26 wherein the similarity component is operative to identify whether the received query or a selected query from the candidate set contain one or more characters of a given writing system.
38. The system of claim 26 wherein the similarity component is operative to calculate a quotient of the frequency with which a selected query from the candidate set follows the received query in one or more query logs and the frequency of the received query in the one or more query logs.
PCT/US2007/062876 2006-02-28 2007-02-27 System and method for identifying related queries for languages with multiple writing systems WO2007101194A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2008557464A JP2009528636A (en) 2006-02-28 2007-02-27 System and method for identifying related queries for languages with multiple writing systems
CN200780006965XA CN101390097B (en) 2006-02-28 2007-02-27 System and method for identifying related queries for languages with multiple writing systems
EP07757547A EP1929415A4 (en) 2006-02-28 2007-02-27 System and method for identifying related queries for languages with multiple writing systems
HK09108573.9A HK1130912A1 (en) 2006-02-28 2009-09-18 System and method for identifying related queries for languages with multiple writing systems

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/365,315 2006-02-28
US11/365,315 US7689554B2 (en) 2006-02-28 2006-02-28 System and method for identifying related queries for languages with multiple writing systems

Publications (2)

Publication Number Publication Date
WO2007101194A2 true WO2007101194A2 (en) 2007-09-07
WO2007101194A3 WO2007101194A3 (en) 2008-03-13

Family

ID=38445252

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/062876 WO2007101194A2 (en) 2006-02-28 2007-02-27 System and method for identifying related queries for languages with multiple writing systems

Country Status (7)

Country Link
US (2) US7689554B2 (en)
EP (2) EP1929415A4 (en)
JP (1) JP2009528636A (en)
KR (1) KR101098703B1 (en)
CN (2) CN102750323B (en)
HK (2) HK1130912A1 (en)
WO (1) WO2007101194A2 (en)

Families Citing this family (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7286115B2 (en) 2000-05-26 2007-10-23 Tegic Communications, Inc. Directional input system with automatic correction
US7821503B2 (en) 2003-04-09 2010-10-26 Tegic Communications, Inc. Touch screen and graphical user interface
US7030863B2 (en) * 2000-05-26 2006-04-18 America Online, Incorporated Virtual keyboard system with automatic correction
US7750891B2 (en) * 2003-04-09 2010-07-06 Tegic Communications, Inc. Selective input system based on tracking of motion parameters of an input device
US7689554B2 (en) * 2006-02-28 2010-03-30 Yahoo! Inc. System and method for identifying related queries for languages with multiple writing systems
US8442965B2 (en) 2006-04-19 2013-05-14 Google Inc. Query language identification
US8762358B2 (en) * 2006-04-19 2014-06-24 Google Inc. Query language determination using query terms and interface language
US7689548B2 (en) * 2006-09-22 2010-03-30 Microsoft Corporation Recommending keywords based on bidding patterns
US7925498B1 (en) 2006-12-29 2011-04-12 Google Inc. Identifying a synonym with N-gram agreement for a query phrase
US8201087B2 (en) 2007-02-01 2012-06-12 Tegic Communications, Inc. Spell-check for a keyboard system with automatic correction
US8225203B2 (en) 2007-02-01 2012-07-17 Nuance Communications, Inc. Spell-check for a keyboard system with automatic correction
US20080250008A1 (en) * 2007-04-04 2008-10-09 Microsoft Corporation Query Specialization
WO2008151466A1 (en) * 2007-06-14 2008-12-18 Google Inc. Dictionary word and phrase determination
US8290921B2 (en) * 2007-06-28 2012-10-16 Microsoft Corporation Identification of similar queries based on overall and partial similarity of time series
US8090709B2 (en) * 2007-06-28 2012-01-03 Microsoft Corporation Representing queries and determining similarity based on an ARIMA model
US8132112B2 (en) * 2007-12-03 2012-03-06 Ebay Inc. Live search chat room
US7831588B2 (en) * 2008-02-05 2010-11-09 Yahoo! Inc. Context-sensitive query expansion
US8150838B2 (en) * 2008-03-31 2012-04-03 International Business Machines Corporation Method and system for a metadata driven query
JP5391583B2 (en) * 2008-05-29 2014-01-15 富士通株式会社 SEARCH DEVICE, GENERATION DEVICE, PROGRAM, SEARCH METHOD, AND GENERATION METHOD
US8171021B2 (en) * 2008-06-23 2012-05-01 Google Inc. Query identification and association
US8745051B2 (en) * 2008-07-03 2014-06-03 Google Inc. Resource locator suggestions from input character sequence
US20100106704A1 (en) * 2008-10-29 2010-04-29 Yahoo! Inc. Cross-lingual query classification
US9053197B2 (en) * 2008-11-26 2015-06-09 Red Hat, Inc. Suggesting websites
FR2940693B1 (en) * 2008-12-30 2016-12-02 Thales Sa OPTIMIZED METHOD AND SYSTEM FOR MANAGING CLEAN NAMES FOR OPTIMIZING DATABASE MANAGEMENT AND INTERROGATION
CN101464897A (en) * 2009-01-12 2009-06-24 阿里巴巴集团控股有限公司 Word matching and information query method and device
EP2328366A1 (en) * 2009-11-20 2011-06-01 Alcatel Lucent Method and system for conducting surveys
US20110153423A1 (en) * 2010-06-21 2011-06-23 Jon Elvekrog Method and system for creating user based summaries for content distribution
US20110153414A1 (en) * 2009-12-23 2011-06-23 Jon Elvekrog Method and system for dynamic advertising based on user actions
US8751305B2 (en) * 2010-05-24 2014-06-10 140 Proof, Inc. Targeting users based on persona data
US20110295897A1 (en) * 2010-06-01 2011-12-01 Microsoft Corporation Query correction probability based on query-correction pairs
CN102298582B (en) * 2010-06-23 2016-09-21 商业对象软件有限公司 Data search and matching process and system
US8442987B2 (en) * 2010-08-19 2013-05-14 Yahoo! Inc. Method and system for providing contents based on past queries
US20120136718A1 (en) * 2010-11-29 2012-05-31 Microsoft Corporation Display of Search Ads in Local Language
CN102567408B (en) 2010-12-31 2014-06-04 阿里巴巴集团控股有限公司 Method and device for recommending search keyword
CN103502990A (en) * 2011-04-29 2014-01-08 惠普发展公司,有限责任合伙企业 Systems and methods for in-memory processing of events
US8417718B1 (en) 2011-07-11 2013-04-09 Google Inc. Generating word completions based on shared suffix analysis
US8725497B2 (en) * 2011-10-05 2014-05-13 Daniel M. Wang System and method for detecting and correcting mismatched Chinese character
KR101461062B1 (en) * 2011-10-24 2014-11-17 네이버 주식회사 System and method for recommendding japanese language automatically using tranformatiom of romaji
US8756241B1 (en) * 2012-08-06 2014-06-17 Google Inc. Determining rewrite similarity scores
US9971837B2 (en) * 2013-12-16 2018-05-15 Excalibur Ip, Llc Contextual based search suggestion
US9690860B2 (en) 2014-06-30 2017-06-27 Yahoo! Inc. Recommended query formulation
CN104572836A (en) * 2014-12-10 2015-04-29 百度在线网络技术(北京)有限公司 Method and device for confirming comprehensive relevancy of candidate inquiry sequence
US10380192B2 (en) * 2015-12-08 2019-08-13 Oath Inc. Method and system for providing context based query suggestions
US10169414B2 (en) 2016-04-26 2019-01-01 International Business Machines Corporation Character matching in text processing
US10891578B2 (en) * 2018-03-23 2021-01-12 International Business Machines Corporation Predicting employee performance metrics
US11170183B2 (en) * 2018-09-17 2021-11-09 International Business Machines Corporation Language entity identification
CN110162593B (en) * 2018-11-29 2023-03-21 腾讯科技(深圳)有限公司 Search result processing and similarity model training method and device
US11194850B2 (en) * 2018-12-14 2021-12-07 Business Objects Software Ltd. Natural language query system
US10956466B2 (en) * 2018-12-26 2021-03-23 Paypal, Inc. Machine learning approach to cross-language translation and search
CN110008237B (en) * 2019-01-14 2023-05-02 创新先进技术有限公司 Similar query recognition method and device
CN111629020A (en) * 2019-12-03 2020-09-04 蘑菇车联信息科技有限公司 Remote input method, device, PC (personal computer) terminal, android device and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040261021A1 (en) 2000-07-06 2004-12-23 Google Inc., A Delaware Corporation Systems and methods for searching using queries written in a different character-set and/or language from the target pages

Family Cites Families (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4833610A (en) * 1986-12-16 1989-05-23 International Business Machines Corporation Morphological/phonetic method for ranking word similarities
JP2809341B2 (en) * 1994-11-18 1998-10-08 松下電器産業株式会社 Information summarizing method, information summarizing device, weighting method, and teletext receiving device.
WO1997008604A2 (en) * 1995-08-16 1997-03-06 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US5778361A (en) * 1995-09-29 1998-07-07 Microsoft Corporation Method and system for fast indexing and searching of text in compound-word languages
NZ506229A (en) * 1998-03-03 2003-02-28 Amazon Identifying the items most relevant to a current query based on items selected in connection with similar queries
US20050119875A1 (en) * 1998-03-25 2005-06-02 Shaefer Leonard Jr. Identifying related names
US6493709B1 (en) * 1998-07-31 2002-12-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
US6876997B1 (en) * 2000-05-22 2005-04-05 Overture Services, Inc. Method and apparatus for indentifying related searches in a database search system
JP2001337980A (en) * 2000-05-29 2001-12-07 Sony Corp Electronic program guide retrieving method and electronic program guide retrieving device
US6999932B1 (en) * 2000-10-10 2006-02-14 Intel Corporation Language independent voice-based search system
TW476895B (en) * 2000-11-02 2002-02-21 Semcity Technology Corp Natural language inquiry system and method
CA2431341A1 (en) * 2000-12-12 2002-06-20 Time Warner Entertainment Company, L.P. Digital asset data type definitions
US6892377B1 (en) * 2000-12-21 2005-05-10 Vignette Corporation Method and system for platform-independent file system interaction
US20020165717A1 (en) * 2001-04-06 2002-11-07 Solmer Robert P. Efficient method for information extraction
US7293014B2 (en) * 2001-06-18 2007-11-06 Siebel Systems, Inc. System and method to enable searching across multiple databases and files using a single search
US7051119B2 (en) * 2001-07-12 2006-05-23 Yahoo! Inc. Method and system for enabling a script on a first computer to communicate and exchange data with a script on a second computer over a network
US7403938B2 (en) * 2001-09-24 2008-07-22 Iac Search & Media, Inc. Natural language query processing
US20030065650A1 (en) * 2001-10-03 2003-04-03 Annand Ritchie I. Method and query application tool for searching hierarchical databases
US7149732B2 (en) * 2001-10-12 2006-12-12 Microsoft Corporation Clustering web queries
JP2003296443A (en) * 2002-03-29 2003-10-17 Konica Corp Medical image pick-up device, display control method, and program
US20070208698A1 (en) * 2002-06-07 2007-09-06 Dougal Brindley Avoiding duplicate service requests
JP2004280259A (en) * 2003-03-13 2004-10-07 National Institute Of Information & Communication Technology Search device
US6947930B2 (en) * 2003-03-21 2005-09-20 Overture Services, Inc. Systems and methods for interactive search query refinement
US7051023B2 (en) * 2003-04-04 2006-05-23 Yahoo! Inc. Systems and methods for generating concept units from search queries
CN100485603C (en) * 2003-04-04 2009-05-06 雅虎公司 Systems and methods for generating concept units from search queries
EP1611534A4 (en) * 2003-04-04 2010-02-03 Yahoo Inc A system for generating search results including searching by subdomain hints and providing sponsored results by subdomain
US7051014B2 (en) * 2003-06-18 2006-05-23 Microsoft Corporation Utilizing information redundancy to improve text searches
US20040260681A1 (en) * 2003-06-19 2004-12-23 Dvorak Joseph L. Method and system for selectively retrieving text strings
US7346629B2 (en) * 2003-10-09 2008-03-18 Yahoo! Inc. Systems and methods for search processing using superunits
CA2542823C (en) * 2003-10-21 2015-05-26 Suntory Limited Method for prognostic evaluation of carcinoma using anti-p-lap antibody
US7240049B2 (en) * 2003-11-12 2007-07-03 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US20050210008A1 (en) * 2004-03-18 2005-09-22 Bao Tran Systems and methods for analyzing documents over a network
US7523102B2 (en) * 2004-06-12 2009-04-21 Getty Images, Inc. Content search in complex language, such as Japanese
JP4936650B2 (en) * 2004-07-26 2012-05-23 ヤフー株式会社 Similar word search device, method thereof, program thereof, and information search device
JP4719684B2 (en) * 2004-09-07 2011-07-06 インターマン株式会社 Information search providing apparatus and information search providing system
US20060106769A1 (en) * 2004-11-12 2006-05-18 Gibbs Kevin A Method and system for autocompletion for languages having ideographs and phonetic characters
US7428533B2 (en) * 2004-12-06 2008-09-23 Yahoo! Inc. Automatic generation of taxonomies for categorizing queries and search query processing using taxonomies
US7707201B2 (en) * 2004-12-06 2010-04-27 Yahoo! Inc. Systems and methods for managing and using multiple concept networks for assisted search processing
US7620628B2 (en) * 2004-12-06 2009-11-17 Yahoo! Inc. Search processing with automatic categorization of queries
US20060161520A1 (en) * 2005-01-14 2006-07-20 Microsoft Corporation System and method for generating alternative search terms
JP2006201907A (en) * 2005-01-19 2006-08-03 Konica Minolta Holdings Inc Update detecting device
US7574436B2 (en) * 2005-03-10 2009-08-11 Yahoo! Inc. Reranking and increasing the relevance of the results of Internet searches
US7668808B2 (en) * 2005-03-10 2010-02-23 Yahoo! Inc. System for modifying queries before presentation to a sponsored search generator or other matching system where modifications improve coverage without a corresponding reduction in relevance
US7634462B2 (en) * 2005-08-10 2009-12-15 Yahoo! Inc. System and method for determining alternate search queries
US7752220B2 (en) * 2005-08-10 2010-07-06 Yahoo! Inc. Alternative search query processing in a term bidding system
US7725464B2 (en) * 2005-09-27 2010-05-25 Looksmart, Ltd. Collection and delivery of internet ads
JP5016610B2 (en) * 2005-12-21 2012-09-05 ディジマーク コーポレイション Rule-driven pan ID metadata routing system and network
US7689554B2 (en) * 2006-02-28 2010-03-30 Yahoo! Inc. System and method for identifying related queries for languages with multiple writing systems
US7571162B2 (en) * 2006-03-01 2009-08-04 Microsoft Corporation Comparative web search
US8005816B2 (en) * 2006-03-01 2011-08-23 Oracle International Corporation Auto generation of suggested links in a search system
US8868540B2 (en) * 2006-03-01 2014-10-21 Oracle International Corporation Method for suggesting web links and alternate terms for matching search queries
US20070208702A1 (en) * 2006-03-02 2007-09-06 Morris Robert P Method and system for delivering published information associated with a tuple using a pub/sub protocol
US7599931B2 (en) * 2006-03-03 2009-10-06 Microsoft Corporation Web forum crawler
US8832097B2 (en) * 2006-03-06 2014-09-09 Yahoo! Inc. Vertical search expansion, disambiguation, and optimization of search queries
US20070208704A1 (en) * 2006-03-06 2007-09-06 Stephen Ives Packaged mobile search results

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040261021A1 (en) 2000-07-06 2004-12-23 Google Inc., A Delaware Corporation Systems and methods for searching using queries written in a different character-set and/or language from the target pages

Also Published As

Publication number Publication date
EP1929415A4 (en) 2011-06-15
HK1176711A1 (en) 2013-08-02
KR101098703B1 (en) 2011-12-23
CN102750323B (en) 2016-05-11
EP3301591A1 (en) 2018-04-04
US20080077588A1 (en) 2008-03-27
EP1929415A2 (en) 2008-06-11
CN101390097A (en) 2009-03-18
US20070203894A1 (en) 2007-08-30
WO2007101194A3 (en) 2008-03-13
CN101390097B (en) 2012-07-04
JP2009528636A (en) 2009-08-06
HK1130912A1 (en) 2010-01-08
US7689554B2 (en) 2010-03-30
CN102750323A (en) 2012-10-24
KR20080114764A (en) 2008-12-31

Similar Documents

Publication Publication Date Title
US7689554B2 (en) System and method for identifying related queries for languages with multiple writing systems
US8856145B2 (en) System and method for determining concepts in a content item using context
CA2504106C (en) Related term suggestion for multi-sense query
US7739261B2 (en) Identification of topics for online discussions based on language patterns
US7774333B2 (en) System and method for associating queries and documents with contextual advertisements
US9323827B2 (en) Identifying key terms related to similar passages
CN102982153B (en) A kind of information retrieval method and device thereof
Broder et al. Search advertising using web relevance feedback
US8046355B2 (en) Word decompounder
US7822752B2 (en) Efficient retrieval algorithm by query term discrimination
CN101501630A (en) Method for ranking and sorting electronic documents in a search result list based on relevance
CN102722498A (en) Search engine and implementation method thereof
Thomaidou et al. Automated snippet generation for online advertising
CN102737021A (en) Search engine and realization method thereof
CN112184021B (en) Answer quality assessment method based on similar support set
Wu et al. Keyword extraction for contextual advertisement
Xu et al. Mining Web search engines for query suggestion
Anagnostopoulos et al. Web page summarization for just-in-time contextual advertising
Zhang et al. TOB: Timely Ontologies for Business Relations.
US8065311B2 (en) Relevance score in a paid search advertisement system
Ramakrishna et al. Information retrieval in Telugu language using synset relationships
CN114021554A (en) Network bidding popularization vocabulary optimization method
BRODER et al. Web-Page Summarization for Just-in-Time Contextual Advertising

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
REEP Request for entry into the european phase

Ref document number: 2007757547

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2007757547

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 200780006965.X

Country of ref document: CN

Ref document number: 2008557464

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: KR