WO2015139497A1

WO2015139497A1 - Method and apparatus for determining similar characters in search engine

Info

Publication number: WO2015139497A1
Application number: PCT/CN2014/094933
Authority: WO
Inventors: 项碧波
Original assignee: 北京奇虎科技有限公司; 奇智软件（北京）有限公司
Priority date: 2014-03-19
Filing date: 2014-12-25
Publication date: 2015-09-24

Abstract

Disclosed are a method and apparatus for determining similar characters in a search engine. The method comprises: determining a first character and a second character that are input into a search engine and are to be checked; acquiring a first code character string of the first character and a second code character string of the second character according to a preset rule; calculating a code distance between the first code character string and the second code character string; when the code distance is less than a preset distance threshold, determining that the first character and the second character are similar characters; and establishing a similar character mapping between the first character and the second character in the search engine. Embodiments reflect determination of whether a first character and a second character are similar characters, thereby improving webpage recognition efficiency of a search engine and provide the search engine with more functions.

Description

Method and device for determining shape near words in search engine

Technical field

The invention relates to the technical field of language text information, in particular to a method for determining a near-word in a search engine, a method for providing a search for a Chinese keyword error correction, an instant search method, and a determination in a search engine. A near-word device, a device for providing error correction for searching Chinese keywords, and an instant search system.

Background technique

With the rapid development of the Internet, network applications tend to be diversified, and the amount of information on the Internet has increased dramatically.

In various situations, users often need to input language texts for information interaction. For example, enter a keyword search web page information in a search engine, enter words and phrases in an instant messaging tool to communicate with other users, and the like.

Language words exist in the form of near-words, that is, language characters with similar structures. Language characters are defined as various encoding methods for input, such as five-stroke encoding, pinyin encoding, etc. When the user inputs the language text by using the encoding method, due to the shape of the near-word, it is easy to cause misoperation and input other language characters. As a result, users often need to re-enter language text, which is not only troublesome, but also wastes system resources.

Taking Wushu as an example, the inaccuracy of the five-stroke input text depends on whether the user is careful or cognizant about the Chinese character itself. However, it is not uncommon for the mishandling caused by carelessness or the user's cognition itself to be the wrong type of Chinese characters caused by the typos. For example, a headline of a news newspaper, "The screaming horn is being punished and not shouting," was written as "a slap in the face and not being shouted."

Furthermore, if the user wants to input the search term "Xiang Yu" in the search engine, the relevant webpage information of the historical character Xiang Yu is searched, but the "item" is mistakenly entered as "top", since the "item" and the "top" are also similar. The user is likely to enter the "top feather" without being aware of it, and directly requests the search engine to search for web page information related to "top feather".

On the one hand, mis-operational search results are very different from the original expectations, the user experience is very poor, wasting the resources of the client and the resources of the search engine. On the other hand, the user needs to obtain the webpage information that he is interested in, and will input the keyword again in the search engine to search. The search engine must search, compare, and filter the massive information again to obtain information related to the search keyword, not only the information, but also the information related to the search keyword. User operations are more cumbersome and time consuming, and will greatly increase the burden on search engines and consume more resources of clients and search engines.

Summary of the invention

In view of the above problems, the present invention has been made in order to provide a method for determining a near-word in a search engine, a method for providing a Chinese keyword correction, and a method for overcoming the above problems or at least partially solving the above problems. An instant search method and a corresponding device for determining a near-word in a search engine, a device for providing error correction for searching Chinese keywords, and an instant search system.

According to an aspect of the present invention, a method for determining a near-word in a search engine is provided, including:

Determining the first text and the second text to be verified in the input search engine;

Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;

Calculating an encoding distance between the first encoded string and the second encoded string;

Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;

A near-word mapping relationship between the first text and the second text is established in the search engine.

According to another aspect of the present invention, a method for providing keyword error correction in a search is provided, comprising:

Receiving a search request; the search request includes a search keyword;

When an error is found in the error correction processing on the search keyword, the search keyword is rewritten by using a near word matching the search keyword;

The search is performed by the rewritten search keyword, and search result data matching the rewritten search keyword is obtained.

According to another aspect of the present invention, an instant search method is provided, comprising:

Detecting the currently input text information in the search bar, performing error correction processing on the currently input text information, and providing real-time search result data based on the currently input text information feedback;

When an error is found in the error correction processing of the text information, an approximate character matching the character data included in the text information found to be erroneous is calculated;

And prompting, in the instant search result data, the prompt information of the recommended approximate text for correcting the text information of the found error;

When the trigger indication of the prompt information by the user is received, the instant search result data that is searched by the approximate text corresponding to the trigger indication is provided.

According to another aspect of the present invention, an apparatus for determining a near-word in a search engine is provided, comprising:

a text determining module, configured to determine a first text and a second text to be verified in the input search engine;

a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;

a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;

The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;

The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.

According to another aspect of the present invention, an apparatus for providing keyword error correction in a search is provided, including:

a receiving unit, configured to receive a search request; the search request includes a search keyword;

The rewriting unit is adapted to rewrite the search keyword by using a shape near word matching the search keyword when an error is detected in the error correction processing of the search keyword;

The search unit is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.

According to another aspect of the present invention, an instant search system is provided, comprising:

a text information detecting unit, configured to detect text information currently input in the search bar;

The error correction processing unit is adapted to perform error correction processing on the currently input text information;

a first result providing unit, configured to provide real-time search result data based on the currently input text information feedback;

The approximate word calculation unit is adapted to perform error correction processing on the text information, and when calculating an error, calculate an approximate character that matches the character data included in the text information found to be erroneous;

The error correction prompting unit is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;

The second result providing unit is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user. According to still another aspect of the present invention, a computer program is provided, comprising computer readable code that, when executed on a computing device, causes the computing device to perform the determining of a shape in a search engine as described above Near-word method, method for providing keyword error correction in search, and instant search method.

According to still another aspect of the present invention, a computer readable medium is provided, wherein the computer program described above is stored.

The beneficial effects of the invention are:

In the embodiment of the present invention, by calculating the coding distance between the first encoded character string of the first character and the second encoded character string of the second character in the search engine, whether the first character and the second character are mutually adjacent to each other is determined. It improves the efficiency of search engine web page recognition and increases the function of search engine.

In the embodiment of the present invention, the search keyword is subjected to error correction processing, and the search keyword is rewritten by using a near-word matching the search keyword to obtain search result data that matches the rewritten search keyword. On the one hand, the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency. On the other hand, to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords. The user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.

In the embodiment of the present invention, error correction processing is performed on the text information in the real-time search engine, and the search keyword is rewritten by using approximate text matching the text information to obtain search result data that matches the rewritten text information. On the one hand, the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency. On the other hand, to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords. The user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.

The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.

DRAWINGS

Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:

1 is a flow chart showing the steps of an embodiment of a method for determining a near-word in a search engine, in accordance with one embodiment of the present invention;

2 is a flow chart showing the steps of an embodiment of a method for providing keyword error correction in a search according to an embodiment of the present invention;

3 is a flow chart showing the steps of an embodiment of an instant search method in accordance with an embodiment of the present invention;

4 is a block diagram schematically showing an embodiment of an apparatus for determining a near-word in a search engine according to an embodiment of the present invention;

FIG. 5 is a block diagram schematically showing an embodiment of an apparatus for providing error correction of keywords in a search according to an embodiment of the present invention; FIG.

6 is a block diagram showing the structure of an embodiment of an instant search system according to an embodiment of the present invention;

Figure 7 shows schematically a block diagram of a computing device for performing the method according to the invention;

Fig. 8 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention. .

detailed description

The invention is further described below in conjunction with the drawings and specific embodiments.

Referring to Figure 1, there is shown a method of determining a near-word in a search engine in accordance with one embodiment of the present invention. The flow chart of the steps of the embodiment may include the following steps:

Step 101: Determine a first text and a second text to be verified in the input search engine;

The processing flow of the search engine can generally be divided into two parts, the first part is the front-end user request, and the second part is the back-end production data.

First, the front-end user request processing process can include:

1. The user enters a keyword;

2. Query word analysis, search engine segmentation of keywords;

3. Search, according to the result of the word segmentation, find out the relevant webpage collection from the index created in advance;

4. Sorting, sorting candidate webpages according to dimensions such as content relevance and timeliness;

5. Presentation: Display the sorted web pages.

Second, the backend production data process can include:

1. Web crawling, the crawler crawls the webpage of the Internet and saves it through the link relationship between the webpages;

2. Index production, analyze the crawled and saved webpages, segment the page title and page text, and make an inverted index based on the word segmentation results for front-end retrieval.

The webpage crawled by the crawler can be saved in the webpage database, and the webpage stores a lot of text information, and the webpage database can also be called a corpus.

In a specific implementation, the first text and the second text may be extracted from the corpus to perform verification of whether the characters are close to each other.

In an optional example of the embodiment of the present invention, the first character and the second character may be Chinese characters.

Step 102: Acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule.

The text can have a specific character structure characteristic, and is encoded according to the character structure characteristic, and an input mode is established, so that inputting characters in the electronic device can be realized. For example, the first character and the second character can perform a pinyin input mode, a five-stroke input mode, a stroke input mode, and the like.

Correspondingly, the first text and the second text may correspond to different first encoded strings and second encoded strings for different encoding rules. For example, the code string corresponding to the Pinyin input mode of the "side" is "ce", and the code string corresponding to the Wubi input mode is "WMJh".

In a preferred embodiment of the present invention, the preset rule may include a preset encoding rule, and step 102 may include the following sub-steps:

Sub-step S11, calculating a first encoded character string corresponding to the first character according to a preset encoding rule;

Sub-step S12, calculating a second encoded character string corresponding to the second character according to the encoding rule;

The preset encoding rule may include a five-stroke encoding rule.

Chinese characters are composed of strokes or radicals. In order to input these Chinese characters, Chinese characters can be broken into some of the most commonly used basic units, namely the root. The root can be the radical part of the Chinese character, or it can be part of the radical, or even a stroke.

When the characters form Chinese characters, they can be divided into four types according to the positional relationship between the roots: single, scattered, connected, and intersected. Among them, the single can refer to the root itself as a Chinese character, including the key name root and the word root, such as mouth, wood, etc.; scattered can mean that the roots constituting the Chinese character can maintain a certain distance, such as Han, Xiang, etc.; You can even refer to a single root with a single stroke, such as "丿" and even "mesh" becomes "self"; intersection can refer to the intersection of several radicals to form a Chinese character, for example, "申" is made by "日" ".

Wubi is the abbreviation of Wubi input method, which is a shape code input method. The root is the basic unit of the five-stroke input method. The Chinese characters are encoded according to the strokes and glyph features, the roots are classified according to certain rules, and these roots are distributed on the keyboard as the basic unit for inputting Chinese characters.

Specifically, the five strokes divide the Chinese character pen into five zones: horizontal (same as mention), vertical, 撇, 捺 (same point), and five Area. The roots or symbols are distributed on 25 letter keys in a certain pattern (ie standard QWERTY keyboard, excluding Z key).

When inputting Chinese characters by using the five-stroke input method, the keys corresponding to the roots of the keyboard may be sequentially pressed according to the writing order and structure of the Chinese characters to form an encoded character string, and the system according to the input character group of the encoded character string in the five-stroke input method The desired text is retrieved from the font.

It should be noted that in the Wubi input method, although the application of the identification code makes the rate of the re-code (encoded string) of a single character low, the repetition rate of the phrase is high. Therefore, the Wubi input method generally does not use a large vocabulary to prevent over-multiple codes. Conversely, the Wubi input method is especially suitable for a single text input to achieve higher input efficiency.

Step 103: Calculate an encoding distance between the first encoded string and the second encoded string.

By calculating the encoding distance between the first encoded string and the second encoded string, the similarity between the first encoded string and the second encoded string can be identified.

In a preferred example of an embodiment of the present invention, the encoding distance may include an editing distance. Edit Distance, also known as Levenshtein distance, can refer to the minimum number of edit operations required to convert from one string to another, such as the first encoded string and the second encoded string.

In practice, many editing operations include replacing one string with another, inserting a string, and deleting a string.

For example, converting the string "kitten" to the string "sitting" requires a minimum of three operations:

1, sitten (k → s), the character "k" is replaced by the character "s";

2, sittin (e → i), the character "e" is replaced by the character "i";

3, sitting (→ g), that is, the character "g" is inserted at the end of the string "sittin".

Step 104: When the coding distance is less than a preset distance threshold, determine that the first character and the second character are in close proximity to each other.

The near-word can be a text with a similar glyph structure, which is confusing when used. For example, "self", "has", and "巳" are close to each other.

In the Wubi input method, the root or symbol is generally in the form of a block, which is the same as or similar to the stroke or the radical that constitutes the text, and is concentrated in one or adjacent keys. For example, in a version of the Wubi input method, the root of the H key corresponds to "mesh, top, bu, stop, tiger, head, and".

Since the glyph structures of the near-words are similar, correspondingly, the radicals constituting the near-words are similar.

When using the Wubi input method to input a single text, except for a few key name roots and word roots, in most cases, it is necessary to use the splitting rules to split the text according to the characteristics of Chinese characters. If the split is more than four For each root, enter the first, second, third, and last (last) roots to enter the text.

For example, the splitting rules may include: writing order, taking precedence, taking into account the intuitiveness, being able to connect, and not being able to connect.

The strokes that make up the text or the capital of the radicals have certain rules of use, which can include position rules, writing rules, and so on. For example, "single" next to a single person and "彳" next to a double person are generally on the far left side of the text, and the highest priority is written, such as "you", "100 million", "very", "to", and so on.

The rules for the use of strokes or radicals allow Chinese characters to be divided into single characters (words consisting of strokes such as top, bottom, day, and month, or words consisting of a single radical) and fit words (such as hanging, rest, Take, Ming, etc. consist of words from the radicals).

Specifically, the Chinese character structure can be divided into:

(1) Upper and lower structure: thinking, swearing, taking risks, meaning, safety, and total;

(2) Upper, middle and lower structures: grass, violence, intention, actuality, competition;

(3) Left and right structure: good, shed, and, bee, beach, to, and Ming;

(4) Left, right, and right structures: Xie, tree, inverted, moving, smashing, whip, and arguing;

(5) Fully enclosed structure: encirclement, prisoner, sleepy, Tian, cause, country, solid;

(6) Semi-enclosed structure: package, district, flash, this, sentence, letter, wind;

(7) Interspersed structure: 噩, 兆, 非;

(8) Character-shaped structure: product, Sen, Nie, Jing, Lei, Xin, Yi.

Therefore, in the Wubi input method, due to the similarity between the strokes of the Chinese characters or the radicals and the five-stroke roots, the structure of the Chinese characters and the similarity between the writing rules and the five-stroke splitting rules, the roots of the near-words are respectively removed. A similar or similar encoded string can be obtained. For example, “measure” and “side” are close to each other, and “test” includes three radicals, which are also radicals, which are “氵”, “贝”, “刂”, and their encoded string is “imjh”. "Side" includes three radicals, which are also radicals, which are "亻", "贝", "刂", and their encoded string is "wmjh". Obviously, "imjh" and "wmjh" are very similar. .

Correspondingly, the coding distance is calculated for the first code string and the second code string corresponding to the first character and the second character. When the value is smaller than the preset distance threshold, the similarity is high, and the shape may be considered to be a shape. Near word. Conversely, when the coding distance is greater than or equal to the preset distance threshold, it indicates that the similarity is low and can be regarded as a non-close-word.

For example, in the Wubi input method, since the Chinese character is at most 4 encoded strings, the distance threshold can be preset to 2. For the words "wait" and "hou", apply the five-stroke encoding rule, the encoding string of "waiting" is "whnd", the encoding string of "hou" is "wntd", the encoding between "whnd" and "wntd" If the distance is 1, less than the distance threshold 2, it can be determined that "waiting" and "hou" are mutually similar.

Step 105: Establish a near-word mapping relationship between the first text and the second text in the search engine.

In a specific implementation, the font database may be separately established in the search engine to collect the near-words of the current text and the corresponding near-word mapping relationship.

It should be noted that the near-word mapping relationship may be mutual. For example, the first character and the near-word mapping relationship with the second character may be the first character--the second character; the second-character mapping relationship between the second character and the first character may be the second character--the first character.

In a preferred embodiment of the present invention, the following steps may also be included:

Step 106: Output the first character and the second character and the near-word mapping relationship of the mutually similar words to a specified font database.

By applying the embodiment of the present invention, all the characters can be traversed in the corpus, the near-word of the current text can be searched, and the searched near-word and near-word mapping relationship can be generated to generate a font database of the current text.

For example, the first character font database stores one or more near-word and near-word mapping relationships, such as the first text—the second text, the third text, and the fourth text; and the second text font database stores one Or a plurality of near-word and near-word mapping relationships, such as the second text—the first text, the fifth text, and the sixth text.

Referring to FIG. 2, a flow chart of steps of an embodiment of a method for providing keyword error correction in a search according to an embodiment of the present invention is shown, which may include the following steps:

Step 201: Receive a search request, where the search request includes a search keyword;

The search request may refer to an instruction issued by the user to search using a certain search keyword. For example, a user may issue a search request through a search engine web page, or in a search plugin to issue a search request, and the like. When the user enters a search keyword in the search box of the search engine and clicks or presses the enter key, it is equivalent to A search request has been received; likewise, when a search keyword is entered in the input box of the search plugin and the user presses or presses the enter key, it is equivalent to receiving the search request.

Step 202: When an error is found in the error correction processing on the search keyword, the search keyword is rewritten by using a near word matching the search keyword;

In a specific implementation, the search keyword may be error-corrected using Natural Language Processing (NLP).

Error correction processing can generally be split into two subtasks:

1, spelling error detection (Spelling Error Detection): according to the type of error, can be divided into Non-word Errors and Real-word Errors. Among them, Non-word Errors can mean that the word after spelling is not legal, such as the wrong "giraffe" is written as "graffe"; Real-word Errors can refer to those cases where the spelling error is still legal, such as "there" is spelled "three" (nearly), "peace" is mistakenly spelled as "piece" (same), and "two" is mistakenly spelled as "too" (same). In a specific implementation, spelling correction may be performed based on a Noisy Channel Model or the like;

2, Spelling Error Correction (Spelling Error Correction): error correction of search keywords, can be used to check words, such as errors between adjacent words and words, adjacent words and words, adjacent words and words The check is performed, and the search keyword is rewritten by searching for the closest word in the font database that matches the text at the error in the near-word mapping relationship.

In a preferred embodiment of the invention, the near-word can be obtained in the following manner:

Sub-step S21, determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word;

Sub-step S22, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;

In a preferred example of the embodiment of the present invention, the preset rule may include a preset encoding rule, and the sub-step S22 may further include the following sub-steps:

Sub-step S221, calculating a first encoded character string corresponding to the first character according to a preset encoding rule;

Sub-step S222, calculating a second encoded character string corresponding to the second character according to the encoding rule;

The preset encoding rule may include a five-stroke encoding rule.

Sub-step S23, calculating an encoding distance between the first encoded character string and the second encoded character string;

Sub-step S24, when the encoding distance is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;

Sub-step S25, establishing a near-word mapping relationship between the first text and the second text in the search engine.

It should be noted that, in the embodiment of the present invention, since the application of the sub-step S21 to the sub-step S25 is substantially similar to the application of the method embodiment 1, the description is relatively simple, and the relevant part can be referred to the description of the method embodiment 1. The embodiments of the present invention are not described in detail herein.

Sub-step S31, determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word;

Sub-step S32, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;

Sub-step S33, calculating an encoding distance between the first encoded character string and the second encoded character string;

Sub-step S34, respectively searching for a first input button corresponding to the first encoded character string;

Sub-step S35, respectively searching for a second input button corresponding to the second encoded character string;

Sub-step S36, respectively calculating a button distance between the first input button and the second input button;

Sub-step S37, configuring a weight corresponding to the coding distance according to the button distance;

Sub-step S38, when the coded distance configured with the weight is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;

Sub-step S39, establishing a near-word mapping relationship between the first text and the second text in the search engine.

In the embodiment of the present invention, the key distance between the first input button and the second input button may be the physical distance of the input button on the keyboard.

In the fingering of the QWERTY keyboard, the left index finger control buttons R, T, F, G, V, B, left middle finger control buttons E, D, C, left ring finger control buttons W, S, X, left hand pink finger control button Q, A , Z, right index finger control buttons Y, U, H, J, N, M, right middle finger control buttons I, K, right ring finger control buttons O, L, right hand little finger control button P. Among them, the buttons F and J generally have protrusions as positioning keys.

Due to the presence of the positioning key, when the current finger clicks on a button that does not belong to the control, for example, the left index finger clicks the button E, and the finger span is large, so that the user generally has obvious discomfort, and thus the probability of such a wrong click is small. Conversely, the probability of accidental clicks on the currently controlled finger button is relatively large. For example, the left index finger clicks the button R, and it is easy to accidentally click T.

Therefore, the button distance can be inversely proportional to the weight. Moreover, optionally, the button distance between the input buttons controlled by the same finger can configure the weight coefficient for the weight, and reduce the weight, so that the coding distance of the first text and the second text is smaller, that is, the similarity is higher, so as to reflect the error. The probability of clicking is relatively large.

Step 203: Perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.

After the search keyword rewriting is completed, the network information can be retrieved and matched by using full-text indexing, directory indexing, and the like.

Step 204: Generate a search result page according to the search result data.

The search engine searches in the database. If the network information matching the content requested by the user is found, the relevance and ranking level of each webpage are generally calculated according to the matching degree of the keywords in the network information, the location, the frequency, the link quality, and the like. Then, according to the degree of association, these network information links are returned to the user in order.

Step 205: Prompt information for correcting the search keyword in the search result page.

In a specific implementation, the embodiment of the present invention may be prompted in any form. For example, the information about the error correction of the search keyword may be prompted under the input box of the search engine, and the enhanced prompt function may also be used before the error correction. The text and the error-corrected characters are marked with different colors, and the like, which is not limited by the embodiment of the present invention.

Referring to FIG. 3, a flow chart of steps of an embodiment of an instant search method according to an embodiment of the present invention may be included, which may include the following steps:

Step 301: Detect text information currently input in the search bar;

It should be noted that the Instant Search Search Engine (ISE), also known as the instantaneous search, refers to emerging technologies such as RSS (Simple Information Aggregation)/Atom (a pair of related standards) and Tag (category tags). The foundation, focusing on frequently updated blog sites and news sites in the Chinese world, can provide users with near real-time results.

In a specific implementation, the real-time search engine can detect the text information input by the user in the search bar. As the user inputs the text information in the search bar, the real-time search engine can simultaneously give the search result, and the user continuously inputs new text. Information, the search results page that the instant search engine can refresh at any time will change together.

Step 302: Perform error correction processing on the currently input text information.

In one case, the search keywords may be error-corrected using Natural Language Processing (NLP).

In another case, the language information that is currently input may be error-corrected by using a language model.

The instant search engine can pre-acquire the user's input text information and then train the language model. The trained model can be N-Gram (a language model commonly used in large vocabulary continuous speech recognition), a neural network-based language model, etc., and the learning of the user language model can be performed in a regular or client-side manner.

Of course, the above-mentioned error correction processing method is only an example. When the embodiment of the present invention is implemented, other error correction processing methods may be set according to actual conditions, which is not limited by the embodiment of the present invention. In addition, in addition to the above-mentioned error correction processing method, those skilled in the art may also adopt other error correction processing methods according to actual needs, which is not limited in the embodiment of the present invention.

Step 303, providing real-time search result data based on the currently input text information feedback;

In the instant search, each time the user inputs new text information, the user can automatically initiate a query request to the instant search engine and receive the search result display without triggering the query request by clicking the Enter key.

Step 304: When an error is detected in the error correction processing on the text information, an approximate text matching the character data included in the text information found to be erroneous is calculated;

In a particular implementation, the approximation word can include a near-word and/or a near-word.

The pronunciation sound can be the same or similar words, for example, the pronunciation of "case" and "安" is "an". Among them, Chinese pinyin is composed of initials and finals, and the similarity between the initials and finals of the first and second characters can be calculated separately, and the similarity between the pronunciations is obtained. When the similarity is greater than the preset similarity threshold, It is determined that the first character and the second character are near sound words.

When an error is detected in the error correction processing of the character information, the font information is searched for in the font database to find the closest matching text corresponding to the text at the error.

Sub-step S41, determining the first text and the second text to be verified in the input search engine;

In a specific implementation, the first text and the second text may be extracted from a preset collected corpus to perform verification of whether the characters are close to each other.

Sub-step S42, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;

In a preferred example of the embodiment of the present invention, the preset rule may include a preset encoding rule, and the sub-step 42 may further include the following sub-steps:

Sub-step S421, calculating a first encoded character string corresponding to the first character according to a preset encoding rule;

Sub-step S422, calculating a second encoded character string corresponding to the second character according to the encoding rule;

The preset encoding rule may include a five-stroke encoding rule.

Sub-step S43, calculating an encoding distance between the first encoded character string and the second encoded character string;

In a preferred example of an embodiment of the present invention, the encoding distance may include an editing distance.

Sub-step S44, when the encoding distance is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other.

Sub-step S45, establishing a near-word mapping relationship between the first text and the second text in the search engine.

In another preferred embodiment of the invention, the near-word can be obtained in the following manner:

Sub-step S51, determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word;

Sub-step S52, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;

Sub-step S54, respectively searching for the first input button corresponding to the first encoded character string;

Sub-step S55, respectively searching for a second input button corresponding to the second encoded character string;

Sub-step S56, respectively calculating a button distance between the first input button and the second input button;

Sub-step S57, configuring a weight corresponding to the coding distance according to the button distance;

Sub-step S58, when the coded distance configured with the weight is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;

Sub-step S59, establishing a near-word mapping relationship between the first text and the second text in the search engine.

Step 305: insert, in the instant search result data, prompt information of a recommended approximate text for correcting the text information of the found error;

In a specific implementation, the embodiment of the present invention may perform prompting in any form. For example, information indicating that the recommended approximate text is corrected may be prompted under the input box, and the enhanced prompt function may also be used for the text and recommendation before the error correction. The text is marked with a different color, and the like, which is not limited by the embodiment of the present invention.

Step 306: When receiving a trigger indication of the prompt information by the user, provide real-time search result data that is searched by the approximate text corresponding to the trigger indication.

The trigger indication may refer to an instruction sent by the user to replace the found text message with an approximate text. For example, when the user clicks at the prompt information, it is equivalent to receiving the trigger indication. For another example, when the user selects an approximate text by using a button such as the Tab key and then presses the enter key, it is equivalent to receiving the trigger indication.

When the trigger indication of the prompt information by the user is received, the instant search result data of the text information feedback after the error is found based on the trigger indication may be provided again.

For the method embodiments, for the sake of brevity, they are all described as a series of combinations of actions, but those skilled in the art will appreciate that the present invention is not limited by the described order of actions, as some steps are in accordance with the present invention. It can be done in other orders or at the same time. Secondly, those skilled in the art should also know that The embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

Referring to FIG. 4, a block diagram of an embodiment of an apparatus for determining a near-word in a search engine according to an embodiment of the present invention is shown, which may include the following modules:

The text determining module 401 is adapted to determine the first text and the second text to be verified in the input search engine;

The encoding obtaining module 402 is configured to acquire the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;

The encoding distance calculation module 403 is adapted to calculate an encoding distance between the first encoded character string and the second encoded character string;

The near-word determining module 404 is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;

The mapping relationship determining module 405 is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.

In a preferred embodiment of the present invention, the preset rule may include a preset encoding rule, and the encoding obtaining module may further be configured to:

Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;

Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;

The preset encoding rule includes a five-stroke encoding rule.

In a preferred embodiment of the present invention, the following modules may also be included:

And an output module, configured to output the first character and the second character and the near-word mapping relationship of the mutual near-word to a specified font database.

Referring to FIG. 5, a structural block diagram of an apparatus for providing error correction of keywords in a search according to an embodiment of the present invention is shown, which may include the following units:

The receiving unit 501 is adapted to receive a search request; the search request includes a search keyword;

The rewriting unit 502 is adapted to rewrite the search keyword by using a near-word matching the search keyword when an error is detected in the error correction processing on the search keyword;

The searching unit 503 is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.

In a preferred embodiment of the invention, the near-word can be obtained by calling the following modules:

In a preferred embodiment of the present invention, the preset rule may include a preset encoding rule, and the encoding obtaining module is further adapted to:

The preset encoding rule includes a five-stroke encoding rule.

In a preferred embodiment of the invention, the near-word can also be obtained by calling the following modules:

a first search module, configured to separately search for a first input button corresponding to the first encoded character string;

a second search module, configured to separately search for a second input button corresponding to the second encoded character string;

a button distance calculation module, configured to separately calculate a button distance between the first input button and the second input button;

a weight configuration module, configured to configure a weight corresponding to the coding distance according to the button distance;

The shape near word determination module may also be adapted to:

When the coded distance configured with the weight is less than the preset distance threshold, it is determined that the first character and the second character are in close proximity to each other.

In a preferred embodiment of the invention, the button distance may be inversely proportional to the weight.

A generating unit is adapted to generate a search result page based on the search result data.

The prompting unit is adapted to prompt information for correcting the search keyword in the search result page.

Referring to FIG. 6, a structural block diagram of an embodiment of an instant search system according to an embodiment of the present invention is shown, which may include the following modules:

The text information detecting unit 601 is adapted to detect text information currently input in the search bar;

The error correction processing unit 602 is adapted to perform error correction processing on the currently input text information;

The first result providing unit 603 is adapted to provide real-time search result data based on the currently input text information feedback;

The approximate word calculation unit 604 is adapted to perform an error correction process on the text information to find an approximate character that matches the character data included in the erroneous text information when an error is found;

The error correction prompting unit 605 is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;

The second result providing unit 606 is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user.

In a preferred embodiment of the invention, the approximation word may comprise a near-word and/or a near-word.

The preset encoding rule includes a five-stroke encoding rule.

The shape near word determination module may also be adapted to:

For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement determining a near-word in a search engine and/or providing keyword correction in a search in accordance with an embodiment of the present invention. Wrong and/or instant search for some or all of the functionality of some or all of the components. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

For example, FIG. 7 illustrates a computing device, such as an application server, that can implement a near-word in a search engine, provide keyword error correction in a search, and an instant search in accordance with the present invention. The computing device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720. Memory 720 can be an electronic memory such as a flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Memory 720 has a memory space 730 for program code 731 for performing any of the method steps described above. For example, storage space 730 for program code may include various program code 731 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 720 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 731', ie, code readable by a processor, such as 710, that when executed by a computing device causes the computing device to perform each of the methods described above step.

"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.

It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the right In the requirements, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.

In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

A method for determining a near-word in a search engine, comprising:

Determining the first text and the second text to be verified in the input search engine;

Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;

Calculating an encoding distance between the first encoded string and the second encoded string;

Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;

A near-word mapping relationship between the first text and the second text is established in the search engine.
The method according to claim 1, wherein the preset rule comprises a preset encoding rule, the acquiring a first encoded character string of the first text and a second encoded character string of the second text The steps include:

Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;

Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;

The preset encoding rule includes a five-stroke encoding rule.
The method of claim 1 or 2, further comprising:

And outputting the first character and the second character and the near-word mapping relationship of the mutually near-word to a specified font database.
A method for providing keyword correction in a search, comprising:

Receiving a search request; the search request includes a search keyword;

When an error is found in the error correction processing on the search keyword, the search keyword is rewritten by using a near word matching the search keyword;

The search is performed by the rewritten search keyword, and search result data matching the rewritten search keyword is obtained.
The method of claim 4 wherein said near word is obtained by:

Determining whether the verification to be input into the search engine is a first character and a second text of a near-word;

Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;

Calculating an encoding distance between the first encoded string and the second encoded string;

Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;

A near-word mapping relationship between the first text and the second text is established in the search engine.
The method according to claim 5, wherein the preset rule includes a preset encoding rule, and the first encoding character string of the first character and the second text word are acquired according to a preset rule. The second step of encoding the string includes:

Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;

Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;

The preset encoding rule includes a five-stroke encoding rule.
The method according to claim 5 or 6, wherein the near-word corresponding to the first character in the font database is also obtained by:

Searching for a first input button corresponding to the first encoded string;

Searching for a second input button corresponding to the second encoded character string;

Calculating a button distance between the first input button and the second input button, respectively;

Setting a weight corresponding to the coding distance according to the button distance;

When the coding distance is less than the preset distance threshold, determining that the first text and the second text are in close proximity to each other are:

When the coded distance configured with the weight is less than the preset distance threshold, it is determined that the first character and the second character are in close proximity to each other.
The method of claim 7 wherein said button distance is inversely proportional to said weight.
The method of claim 4, further comprising:

A search result page is generated based on the search result data.
An instant search method, including:

Detecting the currently input text information in the search bar, performing error correction processing on the currently input text information, and providing real-time search result data based on the currently input text information feedback;

When an error is found in the error correction processing of the text information, an approximate character matching the character data included in the text information found to be erroneous is calculated;

And prompting, in the instant search result data, the prompt information of the recommended approximate text for correcting the text information of the found error;

When the trigger indication of the prompt information by the user is received, the instant search result data that is searched by the approximate text corresponding to the trigger indication is provided.
The method of claim 10 wherein said approximate word comprises a near-word and/or a near-word.
The method of claim 11 wherein said near word is obtained by:

Determining whether the first text and the second text in the search engine to be verified are in the form of a near word;

Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;

Calculating an encoding distance between the first encoded string and the second encoded string;

Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;

A near-word mapping relationship between the first text and the second text is established in the search engine.
A device for determining a near-word in a search engine, comprising:

a text determining module, configured to determine a first text and a second text to be verified in the input search engine;

a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;

a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;

The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;

The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
The device according to claim 13, wherein the preset rule comprises a preset encoding rule, and the encoding obtaining module is further adapted to:

Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;

Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;

The preset encoding rule includes a five-stroke encoding rule.
The device according to claim 13 or 14, further comprising:

And an output module, configured to output the first character and the second character and the near-word mapping relationship of the mutual near-word to a specified font database.
A device for providing keyword error correction in a search, comprising:

a receiving unit, configured to receive a search request; the search request includes a search keyword;

The rewriting unit is adapted to rewrite the search keyword by using a shape near word matching the search keyword when an error is detected in the error correction processing of the search keyword;

The search unit is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
The apparatus of claim 16 wherein said near-word is obtained by invoking the following module:

a text determining module, configured to determine a first text and a second text to be verified in the input search engine;

a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;

a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;

The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;

The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
An instant search system comprising:

a text information detecting unit, configured to detect text information currently input in the search bar;

The error correction processing unit is adapted to perform error correction processing on the currently input text information;

a first result providing unit, configured to provide real-time search result data based on the currently input text information feedback;

The approximate word calculation unit is adapted to perform error correction processing on the text information, and when calculating an error, calculate an approximate character that matches the character data included in the text information found to be erroneous;

The error correction prompting unit is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;

The second result providing unit is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user.
A computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform a determination in a search engine according to any of claims 1-12 Near-word method, method for providing keyword error correction in search, and instant search method.
A computer readable medium storing the computer program of claim 19.