KR100508353B1

KR100508353B1 - Method of spell-checking search queries

Info

Publication number: KR100508353B1
Application number: KR10-2003-0085467A
Authority: KR
Inventors: 사저노암
Original assignee: 구글 잉크.
Priority date: 2003-11-28
Filing date: 2003-11-28
Publication date: 2005-08-17
Also published as: KR20050051811A

Abstract

타겟 텍스트-문자열이 정확하게 철자된 것인지 여부를 결정하기 위한, 컴퓨터로 구현된 방법이 제공된다. 타겟 텍스트-문자열은, 타겟 텍스트-문자열의 발생을 각각 포함하는 문맥의 집합을 결정하기 위하여 언어자료와 비교된다. 귀납법을 사용하여, 집합의 각 문맥은 타겟 텍스트-문자열 및 기준 텍스트-문자열의 언어자료 내에서의 발생에 기초하여 특성화된다. 문맥은, 타겟 텍스트-문자열의 정확한 철자를 포함하거나, 기준 텍스트-문자열의 부정확한 철자를 포함하거나, 타겟 텍스트-문자열의 불확정적 사용을 포함하는 것으로 특성화된다. 타겟 텍스트-문자열이 기준 텍스트-문자열의 오철자일 가능성은, 타겟 텍스트-문자열의 정확한 철자를 포함하는 문맥의 수량 및 기준 텍스트-문자열의 부정확한 철자를 포함하는 문맥의 수량의 함수로서 계산된다. 한 응용에 있어서, 타겟 텍스트-문자열은 검색 질의를 수신하며, 검색은 철자 체크에 이어서 수행된다.A computer-implemented method is provided for determining whether a target text-string is spelled correctly. The target text-string is compared with the linguistic data to determine a set of contexts each containing the occurrence of the target text-string. Using induction, each context of the set is characterized based on occurrences in the linguistic material of the target text-string and the reference text-string. The context is characterized as including the correct spelling of the target text-string, including an incorrect spelling of the reference text-string, or including an indeterminate use of the target text-string. The likelihood that the target text-string is a misspelling of the reference text-string is calculated as a function of the quantity of the context containing the correct spelling of the target text-string and the quantity of the context containing the incorrect spelling of the reference text-string. In one application, the target text-string receives a search query, and the search is performed following a spell check.

Description

How to check the spelling of a search query {METHOD OF SPELL-CHECKING SEARCH QUERIES}

본 발명은 대체적으로 데이터 통신 네트워크로부터 데이터를 검색하는 것에 관한 것으로, 보다 구체적으로는, 검색 엔진 질의 텍스트 문자열(search engine query text strings)에 대한 컴퓨터로 구현된 철자 체크 기술에 관한 것이다.FIELD OF THE INVENTION The present invention relates generally to retrieving data from a data communications network, and more particularly to computer implemented spell checking techniques for search engine query text strings.

월드와이드웹('웹')은, 데이터 통신 네트워크 (또는 '인터넷')을 통하여 액세스되고 개략적으로 조직화되는(loosely organized) 하이퍼링크된 문서 (hyperlinked document) (예를 들어, 웹 페이지)의 형태로 된 방대한 양의 정보를 포함한다. 웹 상에서 하이퍼링크된 문서의 수가 실질적으로 폭발적 증가를 나타내는 이유 중 하나는, 어느 누구라도 하이퍼링크된 문서들을 업로드 시킬 수 있고, 그 문서들은 다른 하이퍼링크된 문서에 대한 링크를 포함할 수 있기 때문이다. 인터넷을 통해 이용가능한 웹 페이지들은 구조화되지 않은 성질을 갖고 있으며 그 양이 방대하기 때문에, 상관없는 정보를 피하면서 효율적으로 관련 정보를 찾고 검색하기가 어려워진다.The World Wide Web ('Web') is in the form of a hyperlinked document (eg, a web page) that is accessed and roughly organized through a data communication network (or 'Internet'). Contains a large amount of information. One reason for the substantial explosive increase in the number of hyperlinked documents on the Web is that anyone can upload hyperlinked documents, and those documents may contain links to other hyperlinked documents. . Web pages available through the Internet are unstructured and massive, making it difficult to find and retrieve relevant information efficiently while avoiding extraneous information.

컴퓨터 네트워크 (예를 들어 인터넷) 상의 정보를 추려내는 한가지 종래의 방법은 검색 엔진을 사용하는 것이다. 사용자는 검색 엔진을 사용하여 관련 정보에 대한 검색을 시작하게 된다. 검색 엔진은 사용자로부터의 요구(request)에 응답하여 관련 정보를 되돌려 주려고 시도한다. 이러한 요구는 대개 질의(query)(예를 들어, 원하는 주제에 관련된 단어들의 집합)의 형태를 갖는다. 검색 엔진은 웹 페이지들에 대한 다수의 링크를 그 페이지들에 대한 간단한 설명과 함께 되돌려 준다. 웹 상 페이지의 양이 방대하기 때문에, 반환된 페이지(returned pages)가 사용자 의도에 부합하는 주제에 관련되도록 보장하는 것은 웹 검색에 있어 중요한 문제이다. 아마도 웹을 검색하는 가장 간단하고 널리 퍼진 방법은, 질의에 포함된 단어들 모두 또는 다수를 포함하거나 그와 연관성을 갖는 웹 페이지에 대하여 검색하는 것일 것이다. 그와 같은 방법은 일반적으로 텍스트-기반(text-based) 검색이라고 부른다. 웹 상의 텍스트-기반 검색은 매우 부정확할 뿐 아니라 처리 과정에서 몇가지 문제가 발생할 수도 있다.One conventional method of extracting information on a computer network (eg the Internet) is to use a search engine. The user then begins a search for relevant information using a search engine. The search engine attempts to return the relevant information in response to a request from the user. Such a request usually takes the form of a query (eg, a set of words related to the desired subject). The search engine returns a number of links to web pages with a brief description of those pages. Because of the huge amount of pages on the web, ensuring that returned pages relate to topics that match user intentions is an important issue for web search. Perhaps the simplest and most widespread way to search the web is to search for web pages that contain or relate to all or many of the words included in the query. Such a method is commonly referred to as text-based searching. Text-based retrieval on the web is not only very inaccurate, but it can also cause some problems in the process.

좁게 정의된 관련 정보에 대하여 인터넷을 검색하는 과정은, 데이터 네트워크를 통하여 이용가능한 정보 모두에 대한 "짚단" 속에서 관련 정보에 대한 "바늘"을 찾는 것과 마찬가지이다. 검색 처리의 효율성은 검색의 질에 크게 좌우된다. 종종 많은 수의 웹 페이지들이 사용자의 질의(query)에 일치된다. 일반적으로 질의 결과의 표시는 미리 정해진 방법 또는 기준에 따라 순위가 매겨지며, 그에 의해 사용자가 가장 연관성이 있다고 믿어지는 정보로 향하도록 한다. 열악한 품질의 질의는 검색 처리를 그릇된 방향으로 향하게 하고 순위 선정 알고리즘을 방해하며 일반적으로 열악한 검색 결과를 낳게 한다. 비효율적인 인터넷 검색 방법은 상관없는 웹 페이지에 대한 요구들이 웹 페이지 서버를 차지하게 만들며, 상관없는 웹 페이지 정보의 전송으로 데이터 네트워크 경로를 막히게 함으로써, 전체적으로 데이터 네트워크를 느리게 하는 경향이 있다.The process of searching the Internet for narrowly defined related information is like finding a "needle" for related information in a "straw" of all the information available through the data network. The efficiency of search processing depends largely on the quality of the search. Often a large number of web pages are matched to a user's query. In general, the display of query results is ranked according to predetermined methods or criteria, thereby directing the user to the information that is believed to be most relevant. Poor quality queries direct search processing in the wrong direction, interfere with ranking algorithms, and generally result in poor search results. Inefficient Internet search methods tend to cause demand for irrelevant web pages to occupy the web page server and to slow down the data network as a whole by clogging the data network path with the transmission of irrelevant web page information.

인터넷의 규모가 계속 커지고 있기 때문에, 하이퍼링크된 문서를 효율적으로 검색하는 진보된 기술을 확보하는 것이 보다 요구되고 있다.As the Internet continues to grow in size, there is a greater demand for advanced technologies for efficiently retrieving hyperlinked documents.

본 발명은 다양한 수정 및 대안적인 형태로 개량될 수 있으며, 본 발명의 특수한 형태가 도면에서 예시로서 보여지고 있으며 보다 자세히 설명될 것이다. 하지만, 서술된 특수한 실시예에 본 발명에 국한되도록 하고자 하는 것은 아니라는 점을 알아야 한다. 오히려, 본 발명은 첨부된 청구항들에 의해 한정된 발명의 정신 및 범위 내에서 모든 수정, 동등물 및 대체물을 포함하는 것이다.The present invention may be modified in various modifications and alternative forms, with a particular form of the invention being shown by way of example in the drawings and will be described in more detail. It should be understood, however, that the intention is not to limit the invention to the particular embodiments described. Rather, the invention is to cover all modifications, equivalents, and substitutions within the spirit and scope of the invention as defined by the appended claims.

본 발명은 귀납법(heuristics)을 이용하여 텍스트(text)의 철자를 체크하는, 컴퓨터로 구현된 방법에 대한 것이다. 본 발명은 많은 수의 구현 및 응용으로 예시되며, 그 중 몇몇은 아래에 요약된다.The present invention is directed to a computer-implemented method of checking spelling of text using heuristics. The invention is illustrated by a large number of implementations and applications, some of which are summarized below.

본 발명의 예시적인 실시예에 따르면, 컴퓨터로 구현된 애플리케이션 (application)은 단어 또는 어구와 같은 타겟 텍스트-문자열(target text-string)에서의 철자 오류(spelling error) 검출을 위한 방법을 포함한다. 타겟 텍스트-문자열은, 각각 타겟 텍스트-문자열의 발생(occurrences)을 포함하는 문맥(context)의 집합을 결정하기 위하여, 데이터베이스 또는 언어자료(corpus)와 비교된다. 문맥 집합의 각 문맥은 타겟 텍스트-문자열의 정확한 철자(correct spelling)를 포함하거나, 기준 텍스트-문자열(reference text-string) (예를 들어, 다른 단어 또는 어구)의 부정확한 철자(incorrect spelling)를 포함하거나, 문맥에서의 타겟 텍스트-문자열의 불확정적 사용(indeterminate usage)을 포함하는 것으로 특성화된다 (characterized). 타겟 텍스트-문자열이 기준 텍스트-문자열의 오철자 (misspelling)인 가능성(likelihood)은 그 이후에, 타겟 텍스트-문자열의 정확한 철자를 포함하는 문맥의 수량(quantity of context) 및 기준 텍스트-문자열의 부정확한 철자를 포함하는 문맥의 수량의 함수로서 계산된다. 본 발명의 또다른 특정 구현예에서, 타겟 텍스트-문자열이 잘못 철자된(misspelled) 확률(probability)은, 비-불확정적 문맥(non-indeterminate contexts)의 수량에 대한 타겟 텍스트-문자열의 정확한 철자를 포함하는 문맥의 수량의 비율로서 계산된다.According to an exemplary embodiment of the present invention, a computer-implemented application includes a method for detecting spelling errors in a target text-string, such as a word or phrase. The target text-string is compared with a database or a corpus to determine a set of contexts, each containing the occurrences of the target text-string. Each context in the context set contains the correct spelling of the target text-string, or an incorrect spelling of the reference text-string (eg another word or phrase). Or characterized as including indeterminate usage of the target text-string in the context. The likelihood that the target text-string is a misspelling of the base text-string is then followed by the quantity of context containing the exact spelling of the target text-string and the inaccuracy of the base text-string. It is calculated as a function of the quantity of context containing a single spelling. In another particular embodiment of the invention, the probability that the target text-string is misspelled is the correct spelling of the target text-string relative to the quantity of non-indeterminate contexts. It is calculated as the ratio of the quantity of context involved.

본 발명의 다른 일반적인 실시예에 따르면, 컴퓨터로 구현된 애플리케이션은 단어 또는 어구와 같은 타겟 텍스트-문자열에서 철자 오류를 검출한다. 타겟 텍스트-문자열은 문맥의 데이터베이스(database of context)와 비교되어, 그 비교로부터 잠재적으로 상응하는 문맥(potentially-corresponding context)의 집합을 결정하며, "타겟 텍스트-문자열의 발생(occurrence)"을 갖는 집합 내의 문맥 각각은 타겟 텍스트-문자열의 정확한 철자를 포함하거나, 기준 텍스트-문자열의 부정확한 철자를 포함하거나, 또는 불확정적 문맥인 것으로 특성화된다. 본 발명에 따르면, 각 특성화의 수량을 세는 방법(quantification of each characterization)을 사용하여, 컴퓨터 애플리케이션은 타겟 텍스트-문자열이 잘못 철자된 가능성을 계산한다. 예를 들어, X는 타겟 텍스트-문자열의 정확한 철자를 포함하는 문맥의 수량으로 하고, Y는 기준 텍스트-문자열의 부정확한 철자를 포함하는 문맥의 수량으로 하며, Z는 불확정적 문맥의 수량으로 하여, 타겟 텍스트-문자열이 기준 텍스트-문자열의 오철자인 가능성(likelihood)은, X 및 Y의 합에 대한, X 및 Y 중 하나의 함수로서 계산된다. 본 발명의 전형적인 구현에 있어, X, Y 및 Z 각각은 양의 정수이다. 다른 구현예에 있어서, 가능성의 계산은 Z를 포함하지 않는다.According to another general embodiment of the present invention, a computer-implemented application detects spelling errors in a target text-string, such as a word or phrase. The target text-string is compared with a database of contexts to determine a set of potentially corresponding-corresponding contexts from the comparison, having a "occurrence" of the target text-string. Each context in the set is characterized as including the correct spelling of the target text-string, an incorrect spelling of the reference text-string, or being an indeterminate context. According to the present invention, using a quantification of each characterization, the computer application calculates the likelihood that the target text-string is misspelled. For example, X is the quantity of the context containing the exact spelling of the target text-string, Y is the quantity of the context containing the incorrect spelling of the base text-string, and Z is the quantity of the indeterminate context. The likelihood that the target text-string is a misspell of the reference text-string is calculated as a function of one of X and Y, for the sum of X and Y. In a typical embodiment of the invention, each of X, Y and Z is a positive integer. In another embodiment, the calculation of probability does not include Z.

본 발명의 다른 형태에 따르면, 귀납법(heuristics)이 문맥을 특성화하는데 적용되며, 귀납법은 문맥에서의 타겟 텍스트-문자열 및 기준 텍스트-문자열의 발생(occurrences)의 함수가 된다. 문맥에서의 기준 텍스트-문자열의 발생 (occurrences)이 소정의 최소 수량 임계치(predetermined minimum quantity threshold)(예를 들어, 1)와 같거나 더 크고, 문맥에서의 타겟 텍스트-문자열 발생(occurrences)에 대한 기준 텍스트-문자열 발생(occurrences)의 비율이 소정의 비율 임계치(predetermined ratio threshold)와 같거나 더 큰 경우, 그 문맥은 기준 텍스트-문자열의 부정확한 철자를 포함하는 것으로 특성화된다. 문맥에서의 타겟 텍스트-문자열의 발생(occurrences)이 제2 소정의 수량 임계치(예를 들어, 1)와 같거나 더 크고, 문맥에서의 기준 텍스트-문자열 발생(occurrences)에 대한 타겟 텍스트-문자열 발생(occurrences)의 비율이 제2 소정의 비율 임계치와 같거나 더 큰 경우, 그 문맥은 타겟 텍스트-문자열의 정확한 철자를 포함하는 것으로 특성화된다. 정확하게 철자된 것으로 특성화되지도 않고 부정확하게 철자된 것으로 특성화되지도 않은 문맥은 불확정적이 된다.According to another aspect of the present invention, induction is applied to characterize the context, which is a function of the occurrences of the target text-string and the reference text-string in the context. Occurrences of reference text-strings in the context are equal to or greater than a predetermined minimum quantity threshold (e.g., 1) and for target text-occurrences in the context. If the rate of reference text-occurrences is equal to or greater than a predetermined ratio threshold, the context is characterized as including an incorrect spelling of the reference text-string. Occurrences of target text-strings in context are equal to or greater than the second predetermined quantity threshold (eg, 1), and target text-string occurrences for reference text-occurrences in context. If the rate of occurrences is equal to or greater than the second predetermined rate threshold, the context is characterized as including the correct spelling of the target text-string. Contexts that are not characterized correctly or spelled incorrectly are indeterminate.

본 발명의 다른 실시예에 따라, 컴퓨터로 구현된 검색 엔진 애플리케이션은 수신된 검색 질의 내에 포함된 타겟 텍스트-문자열에서의 철자 오류를 검출한다.According to another embodiment of the present invention, a computer-implemented search engine application detects spelling errors in a target text-string included in a received search query.

본 발명의 다른 실시예에 있어서, 타겟 텍스트-문자열에 대응하는 특성(characteristics)을 갖는 기준 텍스트-문자열을 선택하는 단계; 제1 데이터베이스에서 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생(occurrences)의 제1 비율을 계산하는 단계; 제2 데이터베이스에서 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생(occurrences)의 제2 비율을 계산하는 단계; 및 타겟 텍스트-문자열이 잘못 철자된 가능성을 제2 비율에 대한 제1 비율의 함수로서 결정하는 단계;에 의해 타겟 텍스트-문자열 내의 철자 오류를 검출하는 방법이 제공된다. 제1 및 제2 데이터베이스 각각은, 내용상 서로 유사하고 검사되는 텍스트(text being examined)와 유사한 자연 발생 텍스트(naturally occurring text)를 포함하는 언어자료(corpus)이다. 하지만, 제2 데이터베이스는 제1 데이터베이스보다 더 적은 철자 오류를 포함한다. 본 발명의 일 실시예에 있어서, 타겟 텍스트-문자열에 대응하는 상기 특성(characteristics)은, 타겟 텍스트-문자열 및 기준 텍스트-문자열 사이의 편집 거리(edit distance)와, 기준 텍스트-문자열의 이전에 식별된 오철자(previously identified misspelling) 및 타겟 텍스트-문자열 사이의 편집 거리를 포함한다. 편집 거리(edit distance)는 두 문자열 사이의 차이를 측정하는 것으로서, 예를 들어, 하나의 문자열을 다른 문자열로 변환하는데 필요한 동작(각 동작에는 문자(character)의 삽입 또는 삭제가 포함됨)의 횟수가 포함된다.In another embodiment of the present invention, there is provided a method comprising: selecting a reference text-string having characteristics corresponding to a target text-string; Calculating a first ratio of occurrences of the reference text-string to occurrences of the target text-string in the first database; Calculating a second ratio of occurrences of the reference text-string to occurrences of the target text-string in the second database; And determining the likelihood that the target text-string is misspelled as a function of the first ratio to the second ratio; provides a method for detecting a misspelling in the target text-string. Each of the first and second databases is a corpus that includes naturally occurring text that is similar in content and similar to text being examined. However, the second database contains fewer spelling errors than the first database. In one embodiment of the present invention, the characteristics corresponding to the target text-string may include an edit distance between the target text-string and the reference text-string, and previously identified of the reference text-string. Edit distance between the previously identified misspelling and the target text-string. The edit distance is a measure of the difference between two strings. For example, the edit distance is the number of actions required to convert one string to another (including the insertion or deletion of a character). Included.

본 발명의 다른 실시예에 따르면, 타겟 텍스트-문자열은, 문맥(각 문맥은 타겟 텍스트-문자열의 발생(occurrence)을 포함함)의 집합을 결정하기 위하여, 데이터베이스 또는 언어자료와 비교된다. 문맥 집합의 각 문맥은, 언어자료를 이용하고 그 이후에 향상-철자된(better-spelled) 언어자료를 이용하여, 타겟 텍스트-문자열의 정확한 철자를 포함하거나, 기준 텍스트-문자열(예를 들어, 다른 단어 또는 어구)의 부정확한 철자를 포함하거나, 또는 문맥에서 타겟 텍스트-문자열의 불확정적 사용을 포함하는 문맥으로서 특성화된다. 제1 데이터베이스에서의 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생 (occurrences)의 제1 비율이 계산된다. 제2 데이터베이스에서의 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생(occurrences)의 제2 비율이 계산된다. 이 계산을 이용하여, 본 실시예는, 제1 비율 및 제2 비율의 함수로서, 타겟 텍스트-문자열이 잘못 철자된 가능성을 제공한다. 다른 형태에 따르면, 타겟 텍스트-문자열은 컴퓨터로 구현된 데이터 네트워크 검색 엔진에 대한 검색 질의의 일부로서 수신된다.According to another embodiment of the present invention, the target text-string is compared with a database or linguistic material to determine a set of contexts, each context comprising the occurrence of the target text-string. Each context in the context set can contain the exact spelling of the target text-string, or use a reference text-string (e.g., using language material and then using better-spelled language material). Characterized as a context that includes incorrect spellings of other words or phrases) or that includes an indeterminate use of a target text-string in the context. A first ratio of occurrences of the reference text-string to occurrences of the target text-string in the first database is calculated. A second ratio of occurrences of the reference text-string to occurrences of the target text-string in the second database is calculated. Using this calculation, this embodiment provides the possibility that the target text-string is misspelled as a function of the first ratio and the second ratio. According to another form, the target text-string is received as part of a search query for a computer-implemented data network search engine.

본 발명의 다른 실시예에 따르면, 웹 페이지 정보는 제1 타겟 웹 페이지를 식별하는(identifying) 사용자 질의에 응하여 제어된다. 사용자 질의의 각 텍스트-문자열은 철자가 체크된다. 정확하게 철자된 검색 질의는 제2 타겟 웹 페이지를 식별한다. 제2 타겟 웹 페이지가 적어도 하나의 목적지 웹 페이지 (destination web page)에 대응하는지 여부를 결정하기 위하여, 데이터베이스가 검색된다. 상기 적어도 하나의 목적지 웹 페이지에 대응하는 제2 타겟 웹 페이지에 응하여, 사용자가 목적지 웹 페이지에 액세스하도록, 링크 정보가, 링크를 평가하는 것에 관련된 주변 정보와 함께 제공된다.According to another embodiment of the present invention, the web page information is controlled in response to a user query identifying the first target web page. Each text-string in the user query is spelled checked. The correctly spelled search query identifies the second target web page. The database is searched to determine whether the second target web page corresponds to at least one destination web page. In response to the second target web page corresponding to the at least one destination web page, link information is provided along with surrounding information related to evaluating the link so that the user accesses the destination web page.

본 발명의 상기 요약은 본 발명의 모든 구현예 또는 실시예를 기술하기 위한 것이 아니다. 도면 및 자세한 설명은 이러한 실시예들을 보다 분명하게 예시한다.The above summary of the present invention is not intended to describe all embodiments or embodiments of the present invention. The drawings and detailed description more clearly exemplify these embodiments.

(상세한 설명)(details)

본 발명은, 워드 프로세싱 내의 컴퓨터로 구현된 철자 체크 애플리케이션, 음성 인식/모사(transcription) 및 텍스트-조작 프로그램을 포함하는 다양한 텍스트 패턴 인식 방법에 응용가능하다. 본 발명은, 예를 들어, 데이터 네트워크 검색 엔진 애플리케이션(application)과 같은, 컴퓨터로 구현된 정보 검색 및 탐색 애플리케이션에 특히 적합하다는 것을 알게 되었다. 본 발명은 그와 같은 검색 엔진 철자 체크 애플리케이션에 특히 한정되지는 않지만, 그 문맥(context)을 이용한 다양한 예시를 검토함으로써 본 발명의 다양한 일면들이 이해될 것이다.The present invention is applicable to a variety of text pattern recognition methods, including computer implemented spell checking applications, word recognition / transcription and text-manipulation programs in word processing. It has been found that the present invention is particularly suitable for computer-implemented information retrieval and search applications, such as, for example, data network search engine applications. The invention is not particularly limited to such search engine spell checking applications, but various aspects of the invention will be understood by reviewing various examples using the context thereof.

엄청난 양의 디지털 정보는 디지털 프로세서에 연결된 모니터 상에 디스플레이된 텍스트를 통하여 사람에게 전달된다. 이에 따라, 컴퓨터로 구현된 철자 체크 루틴은 잠재적인 텍스트 철자 오류(potential text spelling errors)를 식별하기 위한 수단으로서 점점 더 요구되고 있다. 단어는 생각이 전달되는 수단으로서의 문어 및 구어의 재료가 된다. 알파벳 글자(letters of the alphabet)는 음성 기호로 사용되는 문자들의 미리 정의된 집합이다. 글자들의 특정한 배열은 소정의 인정된 권위기관(some accepted authority)에 의해 단어(word)로서 인식되고, 1개 이상의 의미(예를 들어, 사상)가 글자들의 배열과 연관된다. 텍스트의 문자열 내에서의 특정된 순서와 함께 특정 글자들은 상이한 의미를 갖는 단어들이 서로 구별되도록 하는 중요한 특징이다. 일반적으로, 권위기관(예를 들어 사전 발행자)은 인식된 단어 및 그들의 관련 의미를 표로 만들게(tabulates) 된다.An enormous amount of digital information is delivered to humans via text displayed on monitors connected to digital processors. Accordingly, computer-implemented spell checking routines are increasingly required as a means to identify potential text spelling errors. Words become the material of written and spoken language as a means of conveying ideas. Letters of the alphabet are a predefined set of letters used as phonetic symbols. A particular arrangement of letters is recognized as a word by some accepted authority, and one or more meanings (eg, ideas) are associated with the arrangement of letters. Certain letters along with the specified order within a string of text are an important feature that allows words with different meanings to be distinguished from one another. In general, authority bodies (eg dictionary issuers) tabulate recognized words and their associated meanings.

비슷하게 들리는 단어들(like-sounding words)은 상이한 철자 및 상이한 의미를 가질 수 있다. 영어에 있어서, 특정 단어가 사용되는 문맥(context)은 단어의 의미 및/또는 발음에 영향을 줄 수 있다. 언어는, 예를 들어 문법과 같은, 단어 사용의 공통적인 이해를 돕기 위한 규칙의 집합을 포함한다. 규칙들은 형식적일 수도 있고, 속어와 같이 매우 비형식적인 것일 수 있다. 영어에 있어서 단어는, 문어 텍스트에 있어서 띄어쓰기에 의해서 표시되고, 구어 텍스트에 있어서 단어 사이의 끊어 읽기에 의해 표시된다. 특정 단어(예를 들어, 타겟 단어)의 문맥은 그 타겟 단어에 인접하거나 근처에 있는 단어이다.Like-sounding words may have different spellings and different meanings. In English, the context in which a particular word is used may affect the meaning and / or pronunciation of the word. The language includes a set of rules to facilitate a common understanding of word usage, such as, for example, grammar. Rules can be formal or very informal, such as slang. In English, words are displayed by spacing in written text, and by reading between words in spoken text. The context of a particular word (eg, target word) is a word adjacent to or near the target word.

텍스트-문자열에 있어서의 철자 오류는 다수의 이유로 인하여 발생한다. 단어로 표시되지만 인식된 단어 표(tabulation of recognized words)에는 포함되지 않는 (예를 들어, "silver" 대신 "silber") 글자의 문자열(string of letters)(또는 텍스트-문자열)은 다른 단어의 오철자(misspelling)일 가능성이 있다. 하지만, 인식되지 않은 텍스트-문자열은 인식된 단어의 표에는 관련 의미를 갖는 것으로 아직 포함되지 않은 새로운 단어를 구성할 수도 있으며, 특정 인물, 장소 또는 사물을 지칭하는 적절한 명칭이 될 수도 있다. 인식되지 않은 텍스트-문자열은 종종 인식된 단어와 매우 유사하며, 예를 들어, 1개 이상의 부가적인 글자(letter), 생략된 글자, 순서가 바뀐 글자 및 대체된 글자를 갖는 단어일 수도 있다. 사용자가 잘못된 키를 누르는 경우의 타이핑 오류(typing error)는 단어 내에서 대체된 글자로 인한 철자 오류(spelling error)의 한 예이다.Spelling errors in text-strings occur for a number of reasons. Strings of letters (or text-strings) that appear as words but are not included in the tabulation of recognized words (for example, "silber" instead of "silver") are characters of other words. It may be misspelling. However, the unrecognized text-string may constitute a new word that has a relevant meaning in the table of recognized words and is not yet included, or may be an appropriate name for a specific person, place or thing. Unrecognized text-strings are often very similar to recognized words and may be, for example, words with one or more additional letters, omitted letters, out of order letters and replaced letters. Typing errors when the user presses the wrong key are examples of spelling errors due to letters replaced in a word.

영어는 유사한 발음을 갖지만 철자가 다른, 음성적으로 유사한 단어들(예를 들어, blue 와 blew)을 포함한다. 때때로, 음성적으로 동일한 단어는, 의도된 의미를 갖는 단어 대신에 특정 문맥에서 의도적으로 그러나 부정확한 방식으로 사용되기도 한다. 정확하게 철자된 단어의 부정확한 사용은 철자 오류의 다른 형태로 여겨진다. 이와 같은 "잘못 철자된(misspelled)" 단어는 인식된 단어 표에 존재하고, 따라서 그 단어가 사용된 문맥으로부터만 감지 가능하며, 그 "잘못 철자된" 단어의 의미는, "잘못 철자된" 단어 주위의 단어에 의해 전달되는 의미와 불일치한다. 컴퓨터로 구현된 프로세스의 증가된 사용으로 인하여 디지털 형태의 정보가 증가함에 따라, 통신 및/또는 저장을 위한 디지털화 이전의 중간 단계로서, 사상은 점점 더 많이 문어 형태(예를 들어 단어들)로 문자화된다.English includes phonetically similar words that have similar pronunciations but different spellings (eg, blue and blew). Sometimes words that are phonetically identical are used intentionally but incorrectly in a specific context instead of words having an intended meaning. Incorrect use of correctly spelled words is considered another form of misspelling. Such "misspelled" words exist in the table of recognized words, and therefore can only be detected from the context in which they are used, and the meaning of the words "mispelled" is "word misspelled." Inconsistent with the meaning conveyed by the surrounding words. As information in digital form increases due to the increased use of computer-implemented processes, as an intermediate step before digitization for communication and / or storage, ideas are increasingly written in written form (eg words). do.

텍스트-문자열 철자 오류는 저품질의 정보 검색 질의(information search query)의 원인이 되며, 그 결과 정보 검색 질의의 결과 역시 질이 떨어지게 된다. 철자 오류는 정확하게 철자된 단어의 부정확한 사용(incorrect usage of correctly spelled words) 및 잘못 철자된 단어(misspelled words)를 포함한다. 예를 들어, 인터넷과 같은 컴퓨터로 구현된 데이터 네트워크에서 "little red wagons"를 검색하고자 하는 사용자는 이상적으로 검색 엔진 애플리케이션을 통하여 "little red wagons"라는 질의에 대하여 검색을 실행하게 된다. 하지만, 사용자는 검색 엔진에 "little rwd wagons"라는 질의를 실수로 입력할 수 있다. 분명히, "rwd"는 의도된 단어인 "red"의 오철자이다. 이러한 철자 오류는, 질의를 입력하기 위해 사용되는 컴퓨터 키보드 상에 "w"키가 "e"키 근처에 있기 때문에 발생한 타이핑 오류 때문이다. 잘못 철자된 단어는 단어로서 인식될 수 없으며, 종래의 검색 엔진은 전형적으로 질의 텍스트-문자열인 "rwd"에 관련된 웹 페이지를 발견하기 위한 시도를 포함하는 결과를 나타내기 마련이다. 검색은, 검색 엔진이 의도된 단어인 "red"와 연관된 웹 페이지를 발견하기 위한 시도를 하지 않기 때문에, 더 잘못 진행된다.Text-string spelling errors cause poor quality information search queries, resulting in poor quality information search queries. Spelling errors include incorrect usage of correctly spelled words and misspelled words. For example, a user who wants to search for "little red wagons" in a computer-implemented data network, such as the Internet, would ideally perform a search for "little red wagons" via a search engine application. However, a user may mistakenly enter the query "little rwd wagons" into the search engine. Clearly, "rwd" is a misspelling of the intended word "red". This spelling error is due to a typing error that occurs because the "w" key is near the "e" key on the computer keyboard used to enter the query. Mistyped words cannot be recognized as words, and conventional search engines typically show results that include attempts to find web pages related to the query text-string "rwd". The search goes further because the search engine does not attempt to find a web page associated with the intended word "red".

정확하게 철자된 단어의 부정확한 사용(incorrect usage of correctly spelled words)에 기인한 검색 질의 철자 오류는, 예를 들어, "little bed wagons" 또는 음성적으로 정확한 "little read wagons"와 같은 검색 질의에 의해 예시된다. 앞서 언급한 각각의 질의들은 정확하게 철자되었으나 부정확하게 사용된 단어들을 포함한다. "little"과 "wagons" 사이에 위치한 단어는, 그 단어 그룹의 정확한 의미를 이루기에는, 정확하게 철자된 것이 아니다. 각 질의 텍스트-문자열이, 예를 들어, 인식된 단어들의 리스트 또는 룩-업 테이블에 포함됨으로써 단어로 인정되는 것인지 여부를 단순하게 결정하는 방식으로는 검색 질의에서의 오류를 탐지할 수 없다. 오류의 검출은, 언어 내에서 단어를 사용하는 확립된 규칙과 관련되어 있는 단어로서, 텍스트-문자열이 사용되는 문맥 내에서의 각 텍스트-문자열(예를 들어, 단어 또는 어구)을 고려함으로써 가능하다.Search query spelling errors due to incorrect usage of correctly spelled words are illustrated by, for example, search queries such as "little bed wagons" or negatively correct "little read wagons". do. Each of the aforementioned queries contains words that are spelled correctly but are incorrectly used. The word located between "little" and "wagons" is not spelled correctly to achieve the exact meaning of the word group. Errors in the search query cannot be detected in a way that simply determines whether each query text-string is recognized as a word, for example, by being included in a list of recognized words or a look-up table. Detection of errors is possible by considering each text-string (eg a word or phrase) in the context in which the text-string is used as a word associated with an established rule of using the word in the language. .

본 발명의 일 실시예에 있어, 컴퓨터로 구현된 애플리케이션은 단어 또는 어구인 타겟 텍스트-문자열에서의 철자 오류를 검출한다. 타겟 텍스트-문자열은, 타겟 텍스트-문자열의 발생(occurrences)을 각각 포함하는 문맥의 집합(a set of contexts)을 결정하기 위하여, 데이터베이스 또는 언어자료(corpus)와 비교된다. 문맥 집합의 각 문맥은, 타겟 텍스트-문자열의 정확한 철자(correct spelling)를 포함하거나, 기준 텍스트-문자열(예를 들어 다른 단어 또는 어구)의 부정확한 철자(incorrect spelling)를 포함하거나, 문맥에서의 타겟 텍스트-문자열의 불확정적 사용(indeterminate usage)을 포함하는 문맥으로 특성화된다(characterized). 타겟 텍스트-문자열이 기준 텍스트-문자열의 오철자일 가능성(likelihood)은 그 이후에, 타겟 텍스트-문자열의 정확한 철자를 포함하는 문맥의 수량 및 기준 텍스트-문자열의 부정확한 철자를 포함하는 문맥의 수량에 대한 함수로서 계산된다. 본 발명의 다른 실시예에 있어서, 타겟 텍스트-문자열이 잘못 철자된(misspelled) 확률은, 비-불확정적 문맥(non-indeterminate context)의 수량에 대한, 타겟 텍스트-문자열의 정확한 철자를 포함하는 문맥의 수량의 비율로서 계산된다.In one embodiment of the invention, a computer-implemented application detects spelling errors in a target text-string that is a word or phrase. The target text-string is compared with a database or a corpus to determine a set of contexts, each containing the occurrences of the target text-string. Each context of the context set includes the correct spelling of the target text-string, includes an incorrect spelling of the reference text-string (eg another word or phrase), or Characterized by a context that includes indeterminate usage of the target text-string. The likelihood that the target text-string is a misspelling of the base text-string is then the quantity of the context containing the exact spelling of the target text-string and the quantity of the context containing the incorrect spelling of the base text-string. Calculated as a function of In another embodiment of the invention, the probability that the target text-string is misspelled includes a context that includes the correct spelling of the target text-string relative to the quantity of the non-indeterminate context. Is calculated as the ratio of the quantity.

본 발명의 다른 실시예에 따르면, 컴퓨터로 구현된 애플리케이션은 단어 또는 어구인 타겟 텍스트-문자열에서의 철자 오류를 검출한다. 타겟 텍스트-문자열은, 타겟 텍스트-문자열의 발생(occurrence)을 갖는 문맥의 집합을 비교에 의해 결정하기 위하여, 문맥의 데이터베이스(database of contexts)와 비교된다. 본 발명에 따르면, 각 특성화(characterization)의 수량을 세는 것에 의해, 컴퓨터 애플리케이션은 타겟 텍스트-문자열이 잘못 철자된 가능성을 계산한다. 예를 들어, X는 타겟 텍스트-문자열의 정확한 철자를 포함하는 문맥의 수량으로 하고, Y는 기준 텍스트-문자열의 부정확한 철자를 포함하는 문맥의 수량으로 하며, Z는 불확정적 문맥의 수량으로 하여, 타겟 텍스트-문자열이 기준 텍스트-문자열의 오철자일 가능성을 X 및 Y의 합에 대한 X의 함수로서 계산한다. 본 발명의 보다 구체적인 실시예에 따르면, 상기 계산은, X, Y 및 Z가 각각 양의 정수인 상황에서, X 및 Y의 합에 대한 X의 함수이다.According to another embodiment of the present invention, a computer-implemented application detects spelling errors in a target text-string that is a word or phrase. The target text-string is compared with a database of contexts to determine by comparison a set of contexts with the occurrence of the target text-string. According to the invention, by counting the quantity of each characterization, the computer application calculates the likelihood that the target text-string is misspelled. For example, X is the quantity of the context containing the exact spelling of the target text-string, Y is the quantity of the context containing the incorrect spelling of the base text-string, and Z is the quantity of the indeterminate context. Calculate the likelihood that the target text-string is a misspell of the reference text-string as a function of X for the sum of X and Y. According to a more specific embodiment of the invention, the calculation is a function of X for the sum of X and Y, in which X, Y and Z are each positive integers.

본 발명의 다른 형태에 따르면, 문맥을 특성화하기 위하여 귀납법 (heuristics)이 적용되며, 귀납법은 문맥에서의 타겟 텍스트-문자열 및 기준 텍스트-문자열의 발생(occurrences)의 함수이다. 문맥에서의 기준 텍스트-문자열의 발생(occurrences)이 소정의 최소 수량 임계치(예를 들어, 1)와 같거나 더 크고, 문맥에서의 타겟 텍스트-문자열 발생(occurrences)에 대한 기준 텍스트-문자열 발생(occurrences)의 비율이 소정의 비율 임계치와 같거나 더 큰 경우에는, 그 문맥은 기준 텍스트-문자열의 부정확한 철자를 포함하는 것으로 특성화된다. 문맥에서의 타겟 텍스트-문자열의 발생(occurrences)이 제2 소정의 수량 임계치(예를 들어, 1)와 같거나 더 크고, 문맥에서의 기준 텍스트-문자열 발생(occurrences)에 대한 타겟 텍스트-문자열 발생(occurrences)의 비율이 제2 소정의 비율 임계치와 같거나 더 큰 경우에는, 그 문맥은 타겟 텍스트-문자열의 정확한 철자를 포함하는 것으로 특성화된다. 정확하게 철자되거나 잘못 철자된 것으로 특성화될 수 없는 문맥은 "불확정적"(indeterminate)으로 분류되거나 표지가 붙게 된다.According to another form of the invention, induction is applied to characterize the context, which is a function of the occurrences of the target text-string and the reference text-string in the context. Occurrences of the reference text-string in the context are equal to or greater than a predetermined minimum quantity threshold (e.g., 1), and reference text-string occurrences for the target text-string occurrences in the context ( If the rate of occurrences is equal to or greater than the predetermined rate threshold, the context is characterized as including an incorrect spelling of the reference text-string. Occurrences of target text-strings in context are equal to or greater than the second predetermined quantity threshold (eg, 1), and target text-string occurrences for reference text-occurrences in context. If the rate of occurrences is equal to or greater than the second predetermined rate threshold, the context is characterized as including the correct spelling of the target text-string. Contexts that cannot be correctly spelled or characterized as misspelled are classified or labeled as "indeterminate."

본 발명의 다른 일반적 실시예에 따라, 컴퓨터로 구현된 검색 엔진 애플리케이션은 수신된 검색 질의 내에 포함된 타겟 텍스트-문자열에서의 철자 오류를 검출한다.According to another general embodiment of the present invention, a computer-implemented search engine application detects spelling errors in a target text-string included in a received search query.

본 발명의 다른 일반적 실시예에 따르면, 타겟 텍스트-문자열에 대응하는 특성(characteristics)을 갖는 기준 텍스트-문자열을 선택하는 단계; 제1 데이터베이스에서 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생(occurrences)의 제1 비율을 계산하는 단계; 제2 데이터베이스에서 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생(occurrences)의 제2 비율을 계산하는 단계; 및 제2 비율에 대한 제1 비율의 함수로서, 타겟 텍스트-문자열이 잘못 철자된 가능성을 결정하는 단계;에 의해 타겟 텍스트-문자열 내에서 철자 오류가 검출된다. 제1 및 제2 데이터베이스 각각은, 내용상 서로 유사하고 검사되는 텍스트(text being examined)와 유사한 자연 발생 텍스트(naturally occurring text)를 포함하는 언어자료(corpus)이다. 하지만, 제2 데이터베이스는 제1 데이터베이스보다 더 적은 철자 오류를 포함한다.According to another general embodiment of the present invention, there is provided a method comprising: selecting a reference text-string having characteristics corresponding to a target text-string; Calculating a first ratio of occurrences of the reference text-string to occurrences of the target text-string in the first database; Calculating a second ratio of occurrences of the reference text-string to occurrences of the target text-string in the second database; And determining the likelihood that the target text-string is misspelled as a function of the first ratio to the second ratio. A spelling error is detected within the target text-string. Each of the first and second databases is a corpus that includes naturally occurring text that is similar in content and similar to text being examined. However, the second database contains fewer spelling errors than the first database.

본 발명의 다른 실시예에 따르면, 타겟 텍스트-문자열은, 문맥(각 문맥은 타겟 텍스트-문자열의 발생(occurrences)을 포함함)의 집합을 결정하기 위하여, 데이터베이스 또는 언어자료와 비교된다. 문맥 집합의 각 문맥은, 첫번째로 언어자료를 이용하고 두번째로 향상 철자된 언어자료(a better-spelled corpus)를 이용하여, 타겟 텍스트-문자열의 정확한 철자를 포함하는 것으로 또는, 기준 텍스트-문자열(예를 들어, 다른 단어 또는 어구)의 부정확한 철자를 포함하는 것으로 또는, 문맥에서 타겟 텍스트-문자열의 불확정적 사용을 포함하는 문맥으로 특성화된다. 제1 데이터베이스에서의 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생(occurrences)의 제1 비율이 계산된다. 제2 데이터베이스에서의 타겟 텍스트-문자열의 발생(occurrences)에 대한 기준 텍스트-문자열의 발생(occurrences)의 제2 비율이 계산된다. 그리고, 제1 비율 및 제2 비율의 함수로서, 타겟 텍스트-문자열이 잘못 철자된 가능성이 결정된다. 다른 형태에 따르면, 타겟 텍스트-문자열은 컴퓨터로 구현된 데이터 네트워크 검색 엔진에 대한 검색 질의의 일부로서 수신된다.According to another embodiment of the present invention, the target text-string is compared with a database or linguistic material to determine a set of contexts, each context comprising occurrences of the target text-string. Each context in the context set includes the correct spelling of the target text-string, using the first language material and the second, a better-spelled corpus, or the reference text-string ( For example, inaccurate spelling of another word or phrase), or in a context that includes an indeterminate use of a target text-string in the context. A first ratio of occurrences of the reference text-string to occurrences of the target text-string in the first database is calculated. A second ratio of occurrences of the reference text-string to occurrences of the target text-string in the second database is calculated. And, as a function of the first ratio and the second ratio, the probability that the target text-string is misspelled is determined. According to another form, the target text-string is received as part of a search query for a computer-implemented data network search engine.

본 발명의 다른 일반적 실시예에 따르면, 웹 페이지 정보는 제1 타겟 웹 페이지를 식별하는 사용자 질의에 응하여 제어된다. 사용자 질의의 각 텍스트-문자열은 철자가 체크된다. 그 결과로서 정확하게 철자된 검색 질의(the resulting correctly-spelled search query)는 제2 타겟 웹 페이지를 식별한다. 제2 타겟 웹 페이지가 적어도 하나의 목적지 웹 페이지(destination web page)에 대응하는지 여부를 결정하기 위하여, 데이터베이스가 검색된다. 상기 적어도 하나의 목적지 웹 페이지에 대응하는 제2 타겟 웹 페이지에 응하여, 사용자가 목적지 웹 페이지에 액세스하도록, 링크 정보가 링크의 평가에 관련된 주변 정보와 함께 제공된다.According to another general embodiment of the present invention, the web page information is controlled in response to a user query identifying the first target web page. Each text-string in the user query is spelled checked. As a result, the resulting correctly-spelled search query identifies the second target web page. The database is searched to determine whether the second target web page corresponds to at least one destination web page. In response to the second target web page corresponding to the at least one destination web page, link information is provided along with surrounding information related to the evaluation of the link so that the user accesses the destination web page.

본 발명의 다른 실시예에 있어서, 본 발명은, 주어진 텍스트-문자열(예를 들어, 단어 또는 어구)의 임의의 예(random instance of a given text-string)인 "bad_word"가 기준 텍스트-문자열의 오철자(misspelling)일 확률(probability)을 추정하는 프로세스에 관한 것이다. 기준 텍스트-문자열인 "good_word"는 또다른 어구, 단어 또는 그 일부분이다. 상기 확률은 간략한 기호 P_Misspell(bad_word, good_word)으로 표현된다. bad_word는, 그 철자 오류가 검사되는 텍스트로부터 추출된 하나의 텍스트-문자열이다. 자연 발생 텍스트의 큰 언어자료(large corpus of naturally occurring text)는, 내용 및 오철자의 패턴에 있어서 검사되는 텍스트와 유사하다. 본 발명에 따른 방법은 수동 태그 작업(manual tagging) 및 언어자료에의 개입(intervention of corpus)을 필요로 하지 않는다.In another embodiment of the present invention, the present invention provides that a " bad_word ", which is a random instance of a given text-string, refers to a reference text-string. It is about the process of estimating the probability of misspelling. The reference text-string "good_word" is another phrase, word or part thereof. The probability is represented by the simplified symbol P _Misspell (bad_word, good_word). bad_word is a text-string extracted from the text whose spelling error is checked. The large corpus of naturally occurring text is similar to the text being examined in content and misspelling patterns. The method according to the invention does not require manual tagging and intervention of the corpus.

본 발명의 한 형태에 따르면, 언어자료에서의 bad_word의 발생(occurrences)은 문맥 집합으로 나뉘어진다. 문맥(context)은 bad_word의 적어도 한번의 발생(occurrence)을 포함하며, bad_word의 발생(occurrence)에 인접하거나 그 근처에 위치한 단어들에 기초하여 정의된다. 이러한 문맥들 각각에 대하여, 문맥에서의 bad_word의 집합적인 발생(collective occurrence)이, 문맥에서 good_word의 오철자(misspellings)를 포함하는지, 또는 문맥에서 bad_word의 정확한 철자를 포함하는지, 또는 문맥에서 bad_word가 정확한지 부정확한지 여부를 구별하기에 충분한 정보가 없는지(즉, 문맥이 불확정적인지) 여부를 결정하기 위하여, 귀납법 (heuristics)이 적용된다.According to one aspect of the present invention, occurrences of bad_word in language material are divided into context sets. The context includes at least one occurrence of bad_word, and is defined based on words located adjacent to or near the occurrence of bad_word. For each of these contexts, does the collective occurrence of bad_word in the context include misspellings of good_word in the context, or contains the correct spelling of bad_word in the context, or bad_word in the context? In order to determine whether there is not enough information to distinguish whether it is correct or inaccurate (ie, the context is indeterminate), heuristics apply.

확률 P_Misspell(bad_word, good_word)는, 불확정적인 것으로 특성화되지 않은 (예를 들어 문맥에서 good_word의 오철자 또는 bad_word의 정확한 철자) (문맥 집합에서의) 문맥들에서의 bad_word 발생(instances)의 총수에 대한, good_word의 오철자를 포함하는 것으로 특성화된 (문맥 집합에서의) 문맥들에서의 bad_word 발생(instances)의 수의 비율로서 추정된다. 대안적으로, 확률 P_Misspell(bad_word, good_word)는, 불확정적인 것으로 특성화되지 않은 문맥 집합에서의 문맥에서 bad_word 발생의 총수에 대한, bad_word의 정확한 철자를 포함하는 것으로 특성화된 문맥에서 bad_word 발생의 수의 비율을 1에서 뺀 값으로 추정된다.The probability P _Misspell (bad_word, good_word) is based on the total number of bad_word instances in contexts (in the context set) that are not characterized as indeterminate (for example, the spelling of good_word in the context or the correct spelling of bad_word). Is estimated as the ratio of the number of bad_word instances in contexts (in the context set) that are characterized as containing the misspell of good_word. Alternatively, the probability P _Misspell (bad_word, good_word) is the number of bad_word occurrences in a context characterized by containing the correct spelling of bad_word, relative to the total number of bad_word occurrences in a context in a context set that is not characterized as indeterminate. It is estimated by subtracting the ratio from 1.

bad_word의 발생(occurrences)은, 주어진 문맥에서 bad_word의 발생 횟수(the number of occurrences) f_bad를, 주어진 문맥에서 good_word의 발생(occurrences) 횟수 f_good과 비교함으로써, 주어진 문맥에서 good_word의 오철자인 것으로 결정된다. 만약 f_bad가 상당히 큰 수로서 f_good보다 크다면, 주어진 문맥에서의 bad_word의 발생(occurrences)은 정확하게 철자된 bad_word의 발생 (occurrences)이라고 간주되며, 주어진 문맥은 bad_word의 정확한 철자를 포함하는 것으로 결정된다. 만약 f_good가 상당히 큰 수로서 f_bad보다 크다면, 주어진 문맥에서의 bad_word의 발생(occurrences)은 오철자인 것으로 간주되며, 주어진 문맥은 good_word의 오철자를 포함하는 것으로 결정된다. 적어도 하나의 bad_word의 발생(occurrence)을 포함하지만, bad_word의 정확한 철자 또는 good_word의 오철자를 포함하는 것으로 문맥을 특성화하는 기준을 만족시키지 못하는 문맥들은 불확정적(indeterminate)인 것으로 특성화된다. 상술한 것에 부가하여, 다양하고 상이한 귀납법들이 본 발명의 방법에 의해 고려될 수 있다.The occurrences of bad_word are the misspells of good_word in a given context by comparing the number of occurrences f _bad in a given context to the number of occurrences of good_word in a given context, f _good. Is determined. If f _bad is a fairly large number that is greater than f _good , the occurrences of bad_word in a given context are considered to be correctly spelled occurrences of bad_word, and the given context is determined to contain the correct spelling of bad_word. do. If f _good is a fairly large number greater than f _bad , the occurrences of bad_word in a given context are considered to be misspelled, and the given context is determined to contain the misspelled good_word. Contexts that contain at least one occurrence of bad_word, but that do not meet the criteria that characterize the context by including the correct spelling of the bad_word or the misspelling of the good_word, are characterized as indeterminate. In addition to the above, various different induction methods can be considered by the method of the present invention.

본 발명의 일 실시예에 따르면, 문맥에서의 한 단어의 발생(occurrences)은, 주어진 임계를 만족하고, 전체 언어자료(whole corpus)에서 그 단어의 빈도(frequency)에 대한 문맥에서의 그 단어의 빈도의 비율이 적어도 제2 임계치인 경우에 중요하다. 상기 임계치로서 3 및 30이 유용한 것으로 알려져 있다.According to one embodiment of the present invention, occurrences of a word in the context satisfy the given threshold and the word's frequency in the context of the word's frequency in the whole corpus. This is important if the ratio of frequencies is at least a second threshold. 3 and 30 are known to be useful as the threshold.

본 발명의 범위 내에서 다른 중요성-판단(significance-determination) 구현이 고려된다. 중요성 판단의 다른 예시적 구현에 있어서, 비교 임계치(comparison threshold)는 소정의 기준에 따라 동적으로 결정된다.Other significance-determination implementations are contemplated within the scope of the present invention. In another example implementation of the importance determination, the comparison threshold is dynamically determined according to certain criteria.

본 발명의 다른 실시예에 따르면, 본 발명의 2개 언어자료 방법(two corpus method)은 상술한 방법에 부가된 하나의 차원(dimension)이 포함된다. 주어진 문맥에서 bad_word 빈도에 대한 good_word의 (발생(occurrences)) 빈도의 비율(즉, f_good/f_bad)은 상술한 바와 같이 언어자료(예를 들어 주 언어자료(main corpus))로부터 먼저 결정되며, 이 비율은 "주 언어자료 비율(main corpus ratio)"이라고 부르도록 한다. 주어진 문맥에서 bad_word의 빈도에 대한 good-word의 (발생 (occurrences)) 빈도의 다른 비율(f_good/f_bad) 또한 제2의 향상-철자 언어자료 (better-spelled corpus)로부터 결정되며, 이 비율을 "향상-철자 비율(better-spelled ratio)"이라고 부르도록 한다. 향상-철자 언어자료(better-spelled corpus)는 내용 패턴에 있어서 주 언어자료와 유사하지만, 주 언어자료에서 잘못 철자된 것보다 더 적은 단어가 제2 언어자료에서 잘못 철자된다. 문맥은 언어자료들 중 하나로부터 상술한 바와 같이 결정된다. 최종적으로, (2)주 언어자료 비율에 대한 (1)향상-철자 비율의 비율을 "주-대-향상 비율(better-to-main ratio)"이라고 부르도록 한다.According to another embodiment of the present invention, the two corpus method of the present invention includes one dimension added to the above-described method. The ratio of (occurrences) frequency of good_word to bad_word frequency in a given context (i.e. f _good / f _bad ) is first determined from language material (e.g. main corpus), as described above. This ratio is called the "main corpus ratio". Another ratio of the frequency of good-word (occurrences) to the frequency of bad_word in a given context (f _good / f _bad ) is also determined from the second better-spelled corpus. This is called the "better-spelled ratio". The better-spelled corpus is similar to the main language data in the content pattern, but fewer words are misspelled in the second language material than are misspelled in the main language material. The context is determined as described above from one of the linguistic resources. Finally, the ratio of (1) improvement-spelling ratio to (2) main language data ratio is called "better-to-main ratio".

주어진 문맥에서의 주 언어자료 비율(main corpus ratio), 향상-철자 비율(better-spelled ratio), 및 주-대-향상 비율(better-to-main ratio)은 그 이후에, bad_word가 잘못 철자되었는지 여부를 결정하기 위하여 이용된다. 예를 들어, 만약 주 언어자료 비율이 주어진 임계치보다 크고(1을 임계치로 하면 잘 동작함), 향상-철자 비율이 주어진 임계치보다 크며(2를 임계치로 하면 잘 동작함), 주-대-향상 비율이 주어진 임계치보다 크면(2를 임계치로 하면 잘 동작함), bad_word는 주어진 문맥에서 잘못 철자되었을 것이다. 만약, 향상-철자 비율이 주어진 임계치보다 작으면(1을 임계치로 하면 잘 동작함), bad-word는 주어진 문맥에서 정확하게 철자되었을 것이다. 물론 다른 임계치가 이용될 수 있다. 마찬가지로, 주 언어자료 비율, 향상-철자 비율 및/또는 주-대-향상 비율을 다르게 비교하는 것이 사용될 수 있다.The main corpus ratio, the better-spelled ratio, and the better-to-main ratio in a given context are thereafter to see if bad_word is misspelled. It is used to determine whether or not. For example, if the main language data ratio is greater than the given threshold (1 works well), the improvement-spelling ratio is greater than the given threshold (2 works well), and the master-to-enhancement If the ratio is greater than the given threshold (2 works well), then bad_word will be misspelled in the given context. If the improvement-spelling ratio is less than the given threshold (which works well with a threshold of 1), the bad-word will be spelled correctly in the given context. Of course, other thresholds may be used. Similarly, different comparisons of main language data ratios, improvement-spelling ratios, and / or main-to-enhancement ratios can be used.

본 발명의 다른 실시예에 따르면, 2개 언어자료 방법(two corpus method)이 문맥에 대한 제한없이 적용된다. 예를 들어, 만약 주 언어자료 비율이 주어진 임계치(예를 들어, 1)보다 크고, 향상-철자 비율이 주어진 임계치(예를 들어, 2)보다 크고, 주-대-향상 비율이 주어진 임계치(예를 들어, 2)보다 크다면, bad_word는 그 문맥에서 잘못 철자되었을 것이다. 만약 향상-철자 비율이 주어진 임계치(예를 들어, 1)보다 작거나, 또는 만약 주-대-향상 비율이 주어진 임계치(예를 들어, 1.5)보다 크다면, bad_word는 주어진 문맥에서 정확한 것일 것이다. 물론 다른 임계치가 사용될 수 있다. 마찬가지로, 주 언어자료 비율, 향상-철자 비율 및/또는 주-대-향상 비율을 다르게 비교하는 것이 사용될 수 있다.According to another embodiment of the present invention, the two corpus method is applied without limitation to the context. For example, if the main linguistic ratio is greater than a given threshold (e.g. 1), the improvement-spelling ratio is greater than a given threshold (e.g. 2), and the threshold given the main-to-enhancement ratio (e.g. For example, if greater than 2), bad_word would be misspelled in that context. If the improvement-spelling ratio is less than a given threshold (e.g., 1), or if the main-to-enhancement ratio is greater than a given threshold (e.g., 1.5), bad_word will be correct in the given context. Of course, other thresholds may be used. Similarly, different comparisons of main language data ratios, improvement-spelling ratios, and / or main-to-enhancement ratios can be used.

관련 단어는 가끔 문맥-기반 결정(context-based determination)에 있어서 오철자로 잘못 식별되기도(misidentified) 한다. 본 발명의 다른 실시예에 따르면, 이렇게 관련 단어를 잘못 식별하는 것이 줄어든다. 만약, bad_word가 good_word의 범용(universal)(즉, 문맥에 민감하지 않은) 오철자(misspelling)인 경우, bad_word는 good_word가 자주 발생하는 모든 문맥에서 발생할 것으로 예상된다. good_word가 자주 등장하지만 bad_word는 거의 발생하지 않는 적어도 하나의 문맥을 찾은 것은, bad_word가 어느 경우에나 적용될 수 있는 good_word의 범용 오철자가 아닐 수 있다는 것을 나타낸다. 예를 들어, 문맥-기반 결정에서는, "woman"은, 동일한 문맥의 상당한 부분에서 상술한 바와 같은 발생 빈도 귀납법(frequencies of occurrence heuristics)에 기초하여, "women"의 오철자로 결론지어질 수 있다. 하지만, (유명한 영화 제목인) "What Women Want"는 자주 발생하지만, "What Woman Want"는 거의 발생하지 않는다는 것을 관찰하여 볼 때, 어떤 문맥에서는 "woman"이 "women"의 오철자가 아님을 알게 된다. 따라서, "woman"은 "women"의 범용 오철자가 아니며, 본 예시적인 방법은, 타겟 텍스트-문자열(즉, "woman")이 어떤 텍스트에서도 기준 텍스트-문자열(즉, "women")의 오철자인 것은 아니라고 결론을 내리게 된다. 다른 구현예에 있어서, good_word는 자주 발생하지만 bad_word는 거의 발생하지 않는 적어도 N(N은 1보다 큼)개의 문맥을 찾아내는 것은, bad_word가 어느 상황에서든지 적용될 수 있는 good_word의 범용 오철자(universal misspelling)가 아닐 수 있다는 것을 표시하기 위해 필요하다.Related words are sometimes misidentified as a misspell in context-based determination. According to another embodiment of the present invention, this misidentification of related words is reduced. If bad_word is a universal misspelling of good_word (ie, context-insensitive), bad_word is expected to occur in all contexts in which good_word occurs frequently. Finding at least one context in which good_word appears often but bad_word rarely occurs indicates that bad_word may not be a universal misspell of good_word that can be applied in any case. For example, in context-based decisions, a "woman" can be concluded as a misspell of "women", based on the frequencies of occurrence heuristics described above in a substantial part of the same context. . However, observing that "What Women Want" (the famous movie title) occurs frequently, but "What Woman Want" rarely occurs, in some contexts, "woman" is not a "women" misspell. Get to know. Thus, "woman" is not a universal misspell of "women", and this exemplary method allows a target text-string (ie, "woman") to be a mismatch of the base text-string (ie, "women") in any text. You conclude that it is not spelled. In other embodiments, finding at least N (N greater than 1) contexts where good_word occurs frequently but bad_word rarely occurs is a universal misspelling of good_word that can be applied in any situation. It is necessary to indicate that it may not.

본 발명의 다른 실시예에 따르면, 상술한 2개 언어자료 방법으로서 문맥에 민감하지 않은 방법(two corpus, context insensitive method)은, 오철자 가능성(misspelling likelihood)의 결정을 관리하기 위하여, 전술된 문맥에 민감한 방법론(context sensitive methodologies) 중 하나와 조합하여 이용된다. 예를 들어, 타겟 텍스트-문자열이 오철자인 가능성은, 기준 텍스트-문자열이 자주 발생하지만 타겟 텍스트-문자열은 거의 발생하지 않는 적어도 N개의 문맥을 찾은 경우 - 이는 타겟 텍스트-문자열이 기준 텍스트-문자열의 범용 오철자가 아님을 의미함 - 가 아니라면 계산된다.According to another embodiment of the present invention, the two corpus (context insensitive method) as the two-language data method described above is used to manage the determination of misspelling likelihood. It is used in combination with one of the context sensitive methodologies. For example, the likelihood that a target text-string is a misspelling means that you have found at least N contexts where the base text-string occurs frequently but rarely occurs. If not, this means that it is not a universal misspell.

본 발명의 실시예에 있어서 타겟 단어가 오철자인 확률(즉, P_Misspell(bad_word, good_word))을 결정하기 위하여 문맥을 사용하는 것은 전통적인 통계학적 언어 모델(statistical language model, SLM) 기반 철자 수정 방법에 있어서 n-그램 문맥 정보(n-gram contextual information)의 일반적인 사용과는 구별된다. 예를 들어, 다른 SLM-기반 철자 수정 방법들은 트레이닝 언어자료(training corpus)에서 "college cheerleaders"의 빈도에 기초하여 "collage cheerleaders"를 "college cheerleaders"로 수정할 수 있을지도 모른다. 하지만, 그러한 SLM-기반 철자 수정 방법들은 새로운 문맥(즉, "college cheerleaders"와 같은 잘 알려진 문맥이 아닌 문맥)의 경우 또는 따로 떨어져 있는 경우 "collage"를 "college"로 수정할 수는 없을 것이다. 본 발명의 방법은, 다양한 문맥으로부터 유도된 "범용" 오철자(universal misspelling)의 가능성(likelihood)을 결정하고, 중요성 결정(significance determination)의 적용을 통하여 상기 가능성 결정이 정확함을 보장함으로써, "collage"가 통상적으로 "college"의 오철자임을 결정할 수 있다.In an embodiment of the present invention, using the context to determine the probability that the target word is a misspelling (ie, P _Misspell (bad_word, good_word)) is a traditional statistical language model (SLM) based spelling correction method. Is distinguished from the general use of n-gram contextual information. For example, other SLM-based spelling correction methods may modify "collage cheerleaders" to "college cheerleaders" based on the frequency of "college cheerleaders" in the training corpus. However, such SLM-based spelling correction methods will not be able to modify "collage" to "college" in the case of new contexts (i.e., contexts other than well known contexts such as "college cheerleaders"). The method of the present invention determines the likelihood of "universal" misspelling derived from various contexts and ensures that the likelihood determination is accurate through the application of a significance determination, thereby ensuring "collage". It can be determined that "is typically a misspelling of" college ".

본 발명의 다른 실시예에 따르면, 상술한 실시예들은 타겟 텍스트-문자열이 기준 텍스트-문자열의 오철자일 가능성(예를 들어, 확률(probability))을 계산하는데 이용된다. 그 이후에, 확률은 역으로 사용자에게 타겟 텍스트-문자열에 대한 잠재적인 철자 수정으로서의 기준 텍스트-문자열을 제안하거나 선택하는 데에 이용된다. 일 구현예에 있어서, 사용자는, 검사된 타겟 텍스트-문자열에 대한 대용으로서 선택할 수 있는 대안적인 기준 텍스트-문자열들의 랭킹 리스트(ranked list)를 제공받는다.According to another embodiment of the present invention, the above-described embodiments are used to calculate the likelihood (eg, probability) of the target text-string being a misspelling of the reference text-string. Subsequently, the probability is used to inversely suggest or select a reference text-string to the user as a potential spelling correction for the target text-string. In one implementation, the user is provided with a ranked list of alternative reference text-strings that can be selected as a substitute for the examined target text-string.

본 발명의 다른 실시예에 따르면, 상술한 방법들은 데이터 네트워크의 검색으로 향하게 하기 위해 사용되는 텍스트의 철자를 체크하기 위하여 데이터 네트워크 검색 엔진 애플리케이션에서 구현된다. 예를 들어, 인터넷 검색 엔진은 앞서 언급된 것들과 같은 철자-체크 단계를 포함한다. 검색 엔진은 검색 질의 -상기 질의는 제1 타겟 웹 페이지를 식별하는 텍스트-문자열들의 연속임- 에 대하여 사용자에게 지시대기 상태를 알린다(prompt). 검색 질의의 텍스트-문자열들은 철자-검색 검사를 받도록 된다. 일 구현예에 있어서, 사용자는 잘못 철자될 것 같은 것으로 식별된 텍스트-문자열들을 교정하거나 확인하도록 프롬프트된다. 다른 구현예에 있어서, 철자 오류들은 잘못 철자된 것일 가능성이 매우 높은 기준 텍스트-문자열을 이용하여 자동으로 수정된다. 철자-체크 과정에 이은 검색 질의 결과는 제2 타겟 웹 페이지를 식별한다. 검색 엔진은 제2 타겟 웹 페이지를 식별하는 철자-교정된 검색 질의에 기초하여 적어도 하나의 목적지 웹 페이지를 찾기 위해 데이터 네트워크 검색을 실시한다.According to another embodiment of the present invention, the methods described above are implemented in a data network search engine application to check the spelling of text used to direct the search of the data network. For example, an internet search engine includes spell-checking steps such as those mentioned above. The search engine prompts the user for an indication wait state for a search query, wherein the query is a sequence of text-identifying a first target web page. Text-strings in the search query are subject to spell-search check. In one implementation, the user is prompted to correct or confirm the text-strings identified as likely to be misspelled. In another embodiment, spelling errors are automatically corrected using a reference text-string that is very likely to be misspelled. The search query result following the spell-check process identifies the second target web page. The search engine performs a data network search to find at least one destination web page based on the spell-corrected search query identifying the second target web page.

도 1은 본 발명의 철자-체크 기능을 구현하는 컴퓨터 시스템(100)의 한 예를 도시한다. 사용자 컴퓨터(110)는 통신 인터페이스(115)를 통하여 데이터 네트워크 (예를 들어, 인터넷)(180)에 연결된다. 웹 페이지 서버(120, 130, 140)는 각각 데이터 네트워크(180)에 연결되며, 데이터 네트워크를 통하여 웹 페이지(예를 들어, 하이퍼 링크된 문서들) 및 다른 정보를 서비스하도록 되어 있다. 예를 들어, 서버(120)는 웹 페이지(125)를 호스트하고, 서버(130)는 웹 페이지(135)를 호스트하고, 서버(140)는 웹 페이지(145)를 호스트한다. 사용자 컴퓨터(110) 상에서 수행되는 브라우저 애플리케이션은 데이터 네트워크 정보(예를 들어 웹 페이지)의 검색을 용이하게 한다.1 illustrates an example of a computer system 100 that implements the spell-check function of the present invention. User computer 110 is connected to a data network (eg, the Internet) 180 via a communication interface 115. Web page servers 120, 130, and 140 are each connected to data network 180, and are adapted to serve web pages (e.g., hyperlinked documents) and other information via data network. For example, server 120 hosts web page 125, server 130 hosts web page 135, and server 140 hosts web page 145. A browser application running on user computer 110 facilitates the retrieval of data network information (eg, web pages).

컴퓨터 장치(150)는 데이터 네트워크에 연결된다. 컴퓨터(150)는 프로세서(170)에 연결된 저장 미디어(160)를 포함한다. 저장 미디어(160)는 적어도 1개의 데이터베이스(165)를 저장하며, 데이터베이스는, 예를 들어, 철자-체크 언어자료이다. 예시적인 배열에 있어서, 컴퓨터(150)는 프로세서(170)를 통하여 데이터 네트워크 검색 엔진 애플리케이션을 실행한다. 검색 엔진 애플리케이션은 검색 질의에 응하여 목적지 웹 페이지를 찾기 위해 데이터 네트워크를 검색하도록 되어있다. 데이터 네트워크를 검색하는 것은 저장 미디어에 저장된 웹 페이지 요약 데이터베이스를 검색하는 것을 포함하며, 데이터베이스는 각 웹 페이지(125, 135, 145)를 특성화하는 기술 및 링크 정보(descriptive and linking information)를 포함한다.Computer device 150 is connected to a data network. Computer 150 includes storage media 160 coupled to processor 170. Storage media 160 stores at least one database 165, which is, for example, spell-checked language material. In an example arrangement, the computer 150 executes a data network search engine application through the processor 170. The search engine application is adapted to search the data network to find the destination web page in response to a search query. Searching the data network includes searching a web page summary database stored on the storage media, which includes descriptive and linking information that characterizes each web page 125, 135, 145.

사용자는 사용자 컴퓨터(110), 통신 인터페이스(115) 및 데이터 네트워크(180)를 통하여 컴퓨터(150) 상에서 동작하는 검색 엔진 애플리케이션에 검색 질의를 제시한다. 검색 질의는 제1 타겟 웹 페이지를 확인한다. 예를 들어, 제1 타겟 웹 페이지는 웹 페이지(125)에 대응한다. 하지만, 검색 질의는, 예를 들어 사용자가 검색을 통하여 웹 페이지(145)을 검색하고자 하였으나, 사용자가 의도한 목적지 웹 페이지를 잘못 식별하는, 검색 질의의 텍스트 내의 철자 오류를 포함할 수 있다. 검색 엔진 애플리케이션은, 예를 들어, 상술한 바와 같은 철자-체크 방법을 포함하며, 이것은 수신된 검색 질의의 텍스트를 철자 체크하기 위하여 사용된다. 철자 체크 이후의 검색 질의는 제2 타겟 웹 페이지를 식별하며, 예를 들어, 제2 타겟 웹 페이지는 사용자가 의도한 목적지 웹 페이지(145)에 대응한다. 철자 오류 검출 과정의 결과로서 검색 질의의 텍스트 글자가 많이 변경된 경우, 제2 타겟 웹 페이지는 제1 타겟 웹 페이지와 상이할 가능성이 있다. 이와 달리, 철자 오류 검출 과정의 결과로서 검색 질의의 텍스트 글자에 변경이 없거나 미미한 경우, 제2 타겟 웹 페이지는 제1 타겟 웹 페이지와 다르지 않을 수 있다. 검색 엔진은, 식별된 제2 타겟 웹 페이지가 적어도 하나의 목적지 웹 페이지(예를 들어 웹 페이지(145))에 대응하는지 여부를 결정하기 위하여 철자가 체크된 검색 질의에 응하여, 데이터 네트워크 및/또는 데이터베이스의 검색을 수행한다. 그 이후에 검색 엔진은 검색 결과에 기초하여 사용자에게 링크 및 기술 웹 페이지 요약 정보(link and descriptive web page summary information)를 제공한다.A user submits a search query to a search engine application running on computer 150 via user computer 110, communication interface 115, and data network 180. The search query identifies the first target web page. For example, the first target web page corresponds to web page 125. However, the search query may include misspellings in the text of the search query, for example, where the user attempts to search the web page 145 through a search, but incorrectly identifies the destination web page intended by the user. The search engine application includes, for example, a spell-check method as described above, which is used to spell check the text of the received search query. The search query after the spelling check identifies the second target web page, for example, the second target web page corresponds to the destination web page 145 intended by the user. When the text characters of the search query are changed as a result of the spelling error detection process, the second target web page may be different from the first target web page. In contrast, when there is no change or minor change in the text characters of the search query as a result of the spelling error detection process, the second target web page may not be different from the first target web page. The search engine responds to the spelled checked search query to determine whether the identified second target web page corresponds to at least one destination web page (eg, web page 145), the data network and / or Perform a search of the database. The search engine then provides the user with link and descriptive web page summary information based on the search results.

도 2는 본 발명의 언어자료(corpus)의 일 실시예를 도시한다. 데이터베이스 (예를 들어, 언어자료)(210)는 저장 미디어(200)에 저장된다. 도 2는 언어자료에 포함된 텍스트의 발췌 부분을 보여준다. 텍스트로부터, 다양한 문맥이 타겟 텍스트-문자열(예를 들어, 단어)에 기초하여 결정될 수 있다. 타겟 텍스트-문자열 (220)의 한 예는 단어 "red"이다. 도 2에 도시된 바와 같이, 언어자료는 타겟 텍스트-문자열(220)의 1개 이상의 발생(occurrences)을 포함할 수 있다. 타겟 텍스트-문자열의 문맥은 언어자료에서의 타겟 텍스트-문자열의 발생(occurrences)으로부터 결정된다. 예를 들어, 제1 문맥(230)은 단어 "little"의 다음에 오는 타겟 단어로부터 유도된다. 문맥은 단어 "little"의 다음에 오는 텍스트-문자열이다. 문맥의 다른 예들은 도 2에서, 도면 부호 230'(단어 "little"의 다음에 오는 타겟 단어 "red"의 다른 발생(occurrences)), 230" (단어 "little"이 상이한 단어 "bed"의 앞에 오기는 하지만, 동일한 문맥), 및 230"'(단어 "little"은 단어 "blue"의 앞에 옴)으로 도시된다. 이와 달리, 문맥은 단어 "little"의 1 단어 내에 있는 텍스트-문자열로서 정의 가능하고, 앞서 논의된 도면 부호 230', 230" 및 230"'에서 표시된 문맥 예시에 부가하여, 단어(250) "read"가 단어 "little"의 앞에 오는 도면 부호 232로 표시된 단어 배열을 또한 포함한다.Figure 2 illustrates one embodiment of a language material (corpus) of the present invention. The database (eg, language material) 210 is stored in the storage medium 200. Figure 2 shows an excerpt of the text included in the language material. From the text, various contexts can be determined based on the target text-string (eg, words). One example of a target text-string 220 is the word "red". As shown in FIG. 2, the linguistic material may include one or more occurrences of the target text-string 220. The context of the target text-string is determined from the occurrences of the target text-string in the language material. For example, the first context 230 is derived from the target word following the word "little". The context is a text-string following the word "little". Other examples of context are shown in Figure 2 at 230 '(other occurrences of target word "red" following word "little"), 230 "(word" little "before different word" bed ". But in the same context), and 230 "'(the word" little "comes before the word" blue "). Alternatively, the context may be defined as a text-string within one word of the word "little", and in addition to the contextual examples indicated at 230 ', 230 "and 230"' discussed above, the word 250 "read It also includes the word arrangement indicated by reference numeral 232 in which "" precedes the word "little".

언어자료에서의 타겟 단어 "red"의 발생(occurrences)으로부터 유도되는, 제2 문맥(225)은 단어 "paint"의 앞에 오는 단어로서 정의된다. 제3 문맥(240)은, 단어 "little" 및 "Wagons" 사이에 타겟 단어가 위치하는 문맥(240)으로 정의 가능하다. 제2 문맥(225)의 다른 예시는 도 2에 도시되어 있지 않지만, 도면 부호 240'("little" 및 "wagons" 사이에 위치한 타겟 단어 "red"의 발생(occurrences)) 및 240"("little" 및 "wagons" 사이에 위치하지만, 상이한 단어인 "blue")를 포함하는 2개의 다른 제3 문맥의 예가 도시된다. 단어 "read"(250) 및 "blue"(252)는 기준 텍스트-문자열(예를 들어, 단어들)을 나타낸다. 기준 텍스트-문자열은 타겟 텍스트-문자열의 발생(occurrences)으로부터 식별되는, 문맥에서 발생하는 다른 텍스트-문자열이다. 본 발명의 방법에 따르면, 타겟 텍스트-문자열이, 타겟 텍스트-문자열의 문맥으로부터 식별되는 기준 텍스트-문자열의 오철자인지 여부를 결정하기 위하여 귀납법이 이용된다.The second context 225, derived from the occurrences of the target word "red" in the linguistic material, is defined as the word preceding the word "paint". The third context 240 may be defined as the context 240 in which the target word is located between the words "little" and "Wagons". Another example of the second context 225 is not shown in FIG. 2, but reference numeral 240 '(occurrences of the target word "red" located between "little" and "wagons") and 240 "(" little " An example of two other third contexts is shown, which is located between "and" wagons "but includes a different word" blue ") The words" read "250 and" blue "252 are reference text-strings. (Eg, words) The reference text-string is another text-string that occurs in the context, identified from the occurrences of the target text-string, in accordance with the method of the present invention. Induction is used to determine whether this is a misspelling of the reference text-string identified from the context of the target text-string.

본 발명에 따를 경우, 검색 질의의 철자 오류를 보다 효율적이고 정확하게 체크할 수 있어, 효율적인 네트워크 데이터 검색을 가능하게 한다.According to the present invention, spelling errors in search queries can be checked more efficiently and accurately, thereby enabling efficient network data retrieval.

본 발명은 상술한 특정 예에 제한되는 것은 아니며, 청구범위에 제시된 발명의 모든 형태를 포함하는 것으로 의도된 것이다. 예를 들어, 데이터 네트워크 검색 엔진 애플리케이션의 검색 질의의 단어에 대한 철자 체크 방법이 서술되었으나, 컴퓨터로 구현되는 애플리케이션에서의 텍스트-문자열의 철자 체크에 대한 다른 기술이 상술한 바로부터 이득을 얻을 수 있다. 본 발명이 응용될 수 있는 수많은 구성 뿐만 아니라, 다양한 변형형, 균등한 프로세스가 본 명세서를 본 당업자에게 자명할 것이다. 청구항들은 이와 같은 변형 및 장치들을 포함하기 위한 것이다. The invention is not intended to be limited to the specific examples described above, but is intended to include all forms of invention as set forth in the claims. For example, while a spell checking method for words in a search query of a data network search engine application has been described, other techniques for spell checking of text-strings in computer-implemented applications may benefit from the foregoing. . Numerous configurations, as well as numerous modifications and equivalent processes, to which the present invention may be applied will be apparent to those skilled in the art. The claims are intended to cover such modifications and arrangements.

도 1은 본 발명에 따른, 데이터 네트워크 배열의 예시적인 실시예의 시스템 블록도를 도시한다.1 shows a system block diagram of an exemplary embodiment of a data network arrangement, in accordance with the present invention.

도 2는 본 발명의 예시적인 실시예에 따른, 언어자료 데이터베이스 (corpus database)의 예시적인 실시예를 도시한다.2 illustrates an exemplary embodiment of a corpus database, in accordance with an exemplary embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 컴퓨터 시스템 110 : 사용자 컴퓨터100: computer system 110: user computer

115 : 통신 인터페이스 120, 130, 140 : 웹 페이지 서버115: communication interface 120, 130, 140: web page server

150 : 컴퓨터 160, 200 : 저장 미디어150: computer 160, 200: storage media

170 : 프로세서 210 : 데이터베이스170: processor 210: database

Claims

A computer-implemented method for detecting spelling errors in a target text-string,

Compare the target text-string with a database of contexts, and from the comparison,

(1) X contexts each characterized as containing the correct spelling of the target text-string,

(2) Y contexts each characterized as containing an incorrect spelling of a reference text-string, and

(3) Z contexts, each characterized by an indeterminate context

Determining a set of contexts that includes contexts characterized by; And

Calculating a likelihood that the target text-string is a misspelling of the reference text-string as a function of one of X and Y, for the sum of X and Y.

Including, computer-implemented method.

The method of claim 1,

Wherein each of X, Y, and Z is a positive integer.

The method of claim 1,

And said calculation does not comprise Z.

The method of claim 1,

Applying heuristics to characterize the context of the database,

The induction method is a function of the occurrences of the target text-string in the context and the occurrences of the reference text-string in the context.

The method of claim 1,

Occurrences of the reference text-string in the context, relative to the target text-string occurrences in the context, wherein the occurrences of the reference text-string in the context are at least equal to a predetermined minimum quantity threshold. And in response to the rate of occurrences being greater than a predetermined rate threshold, characterizing the context to include incorrect spelling of the reference text-string.

The method of claim 5,

In any context, the occurrences of the reference text-string equal to at least a predetermined maximum quantity threshold and the occurrences of the target text-string equal to at most a predetermined minimum quantity threshold. Computer-implemented method characterized in that not having at the same time.

The method of claim 6,

The predetermined maximum quantity threshold is three and the predetermined minimum quantity threshold is one.

The method of claim 5,

The target text-in the context, for the reference text-occurrences in the context, wherein the occurrences of the target text-string in the context are at least equal to a second predetermined quantity threshold; Characterizing the context as including the correct spelling of the target text-string in response to the rate of occurrences of the string being greater than a second predetermined rate threshold. Way.

The method of claim 8,

And said second predetermined rate threshold equals at least one.

The method of claim 9,

And said second predetermined quantity threshold is three and said second predetermined ratio threshold is one.

The method of claim 8,

The context having the occurrence of the target text-string is not characterized as one comprising the correct spelling of the target text-string and the incorrect spelling of the reference text-string In response, further comprising characterizing it as indeterminate.

The method of claim 1,

Occurrence of the target text-string in the context, relative to the reference text-occurrences in the context, wherein the occurrences of the target text-string in the context are at least equal to a predetermined quantity threshold and characterizing the context as having the correct spelling of the target text-string in response to a rate of occurrences being greater than a predetermined rate threshold.

The method of claim 5 or 12,

And said predetermined ratio threshold is equal to at least one.

The method of claim 13,

The predetermined quantity threshold is three and the predetermined ratio threshold is one.

The method of claim 1,

The target text-string is a word, and the reference text-string is a word.

The method of claim 1,

Wherein the target text-string is a phrase, and wherein the reference text-string is a phrase.

The method of claim 1,

Receiving a search query further;

And the search query comprises the target text-string.

The method according to claim 1 or 17,

And the database is a corpus comprising naturally occurring text that is similar to a search query in a pattern of content and misspelling.

Selecting a reference text-string having a characteristic corresponding to the target text-string;

Calculating from the first database a first ratio of occurrences of the reference text-string to occurrences of the target text-string;

Calculating from a second database a second ratio of occurrences of the reference text-string to occurrences of the target text-string; And

Determining the likelihood that the target text-string is misspelled as a function of the first ratio and the second ratio,

Wherein said first database and said second database are linguistic materials comprising naturally occurring text similar in pattern of content, respectively, and said second database is better-spelled corpus than said first database. Characterized by a computer-implemented method.

The method of claim 19,

The characteristics corresponding to the target text-string include an edit distance between the target text-string and the reference text-string, a previously identified misspelling of the reference text-string, and the And a editing distance between the target text-string.

The method of claim 19,

The target text-string is misspelled if the second ratio is greater than a first comparison threshold and the ratio of the second ratio to the first ratio is greater than a second comparison threshold. Characterized by a computer-implemented method.

The method of claim 21,

And wherein the first comparison threshold is two and the second comparison threshold is two.

The method of claim 19,

The characteristics corresponding to the target text-string include an edit distance between the target text-string and the reference text-string, a previously identified misspelling of the reference text-string, and the Contains the edit distance between the target text-string,

The target text-string is misspelled if the second ratio is greater than a first comparison threshold and the ratio of the second ratio to the first ratio is greater than a second comparison threshold. Implemented by.

Compare the target text-string with a first and a second database of contexts, and from the comparison,

(1) X contexts, each characterized as including the exact spelling of the target text-string,

(2) Y contexts each characterized as containing an incorrect spelling of the reference text-string, and

(3) Z contexts, each characterized by an indeterminate context

Determining a context set that includes contexts characterized by; And

Calculating the likelihood that the target text-string is a misspell of the reference text-string as a function of one of X and Y, for the sum of X and Y

Computer-implemented method comprising a.

The method of claim 24,

(1) a first ratio of occurrences of said reference text-string in said context to said occurrences of said target text-string in said context, wherein said first ratio is determined from said first database -;

(2) a second ratio of occurrences of said reference text-string in said context, to said occurrences of said target text-string in said context, wherein said second ratio is determined from said second database -; And

(3) a third ratio of said second ratio to said first ratio

In response, the context is characterized as including the incorrect spelling of the reference text-string, or as including the correct spelling of the target text-string,

Wherein said first database and said second database are linguistic materials each comprising similar naturally occurring text in a pattern of content, and said second database is better-spelled corpus than said first database. Computer-implemented method.

The method of claim 25,

If the first ratio is greater than the first predetermined comparison threshold, the second ratio is greater than the second comparison threshold, and the third ratio is greater than the third predetermined comparison threshold, then the context may have missed. A computer-implemented method characterized by including.

The method of claim 26,

And wherein the first, second and third comparison thresholds are 1, 2 and 2, respectively.

The method of claim 25,

And if the second ratio is less than a predetermined comparison threshold, the context is characterized as including the correct spelling of the target text-string.

The method of claim 28,

And said predetermined comparison threshold is one.

The method of claim 19 or 25,

Further comprising receiving a search query,

And the search query comprises the target text-string.

A computer-implemented method of controlling web page information in response to a user query identifying a first target web page,

Checking the spelling of each text-string in the user query to identify a second target web page in response to the user query;

Searching a database to determine whether the second target web page corresponds to at least one destination web page; And

In response to the second target web page corresponding to the at least one destination web page, determining link information for allowing a user to access the at least one destination web page and whether to link to the at least one destination web page; Providing peripheral information related thereto;

Computer-implemented method comprising a.