WO2015139497A1 - Method and apparatus for determining similar characters in search engine - Google Patents

Method and apparatus for determining similar characters in search engine Download PDF

Info

Publication number
WO2015139497A1
WO2015139497A1 PCT/CN2014/094933 CN2014094933W WO2015139497A1 WO 2015139497 A1 WO2015139497 A1 WO 2015139497A1 CN 2014094933 W CN2014094933 W CN 2014094933W WO 2015139497 A1 WO2015139497 A1 WO 2015139497A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
search
text
word
encoded
Prior art date
Application number
PCT/CN2014/094933
Other languages
French (fr)
Chinese (zh)
Inventor
项碧波
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201410104483.XA external-priority patent/CN103927330A/en
Priority claimed from CN201410103601.5A external-priority patent/CN103927329B/en
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2015139497A1 publication Critical patent/WO2015139497A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the invention relates to the technical field of language text information, in particular to a method for determining a near-word in a search engine, a method for providing a search for a Chinese keyword error correction, an instant search method, and a determination in a search engine.
  • a near-word device a device for providing error correction for searching Chinese keywords, and an instant search system.
  • users often need to input language texts for information interaction. For example, enter a keyword search web page information in a search engine, enter words and phrases in an instant messaging tool to communicate with other users, and the like.
  • Language words exist in the form of near-words, that is, language characters with similar structures.
  • Language characters are defined as various encoding methods for input, such as five-stroke encoding, pinyin encoding, etc.
  • the user inputs the language text by using the encoding method, due to the shape of the near-word, it is easy to cause misoperation and input other language characters.
  • users often need to re-enter language text, which is not only troublesome, but also wastes system resources.
  • the inaccuracy of the five-stroke input text depends on whether the user is careful or cognizant about the Chinese character itself.
  • mis-operational search results are very different from the original expectations, the user experience is very poor, wasting the resources of the client and the resources of the search engine.
  • the user needs to obtain the webpage information that he is interested in, and will input the keyword again in the search engine to search.
  • the search engine must search, compare, and filter the massive information again to obtain information related to the search keyword, not only the information, but also the information related to the search keyword.
  • User operations are more cumbersome and time consuming, and will greatly increase the burden on search engines and consume more resources of clients and search engines.
  • the present invention has been made in order to provide a method for determining a near-word in a search engine, a method for providing a Chinese keyword correction, and a method for overcoming the above problems or at least partially solving the above problems.
  • a method for determining a near-word in a search engine including:
  • a near-word mapping relationship between the first text and the second text is established in the search engine.
  • a method for providing keyword error correction in a search comprising:
  • the search request includes a search keyword
  • the search keyword is rewritten by using a near word matching the search keyword
  • the search is performed by the rewritten search keyword, and search result data matching the rewritten search keyword is obtained.
  • an instant search method comprising:
  • Detecting the currently input text information in the search bar performing error correction processing on the currently input text information, and providing real-time search result data based on the currently input text information feedback;
  • the instant search result data that is searched by the approximate text corresponding to the trigger indication is provided.
  • an apparatus for determining a near-word in a search engine comprising:
  • a text determining module configured to determine a first text and a second text to be verified in the input search engine
  • a code obtaining module configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule
  • a coding distance calculation module configured to calculate an encoding distance between the first encoded character string and the second encoded character string
  • the near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
  • the mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
  • an apparatus for providing keyword error correction in a search including:
  • a receiving unit configured to receive a search request; the search request includes a search keyword;
  • the rewriting unit is adapted to rewrite the search keyword by using a shape near word matching the search keyword when an error is detected in the error correction processing of the search keyword;
  • the search unit is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
  • an instant search system comprising:
  • a text information detecting unit configured to detect text information currently input in the search bar
  • the error correction processing unit is adapted to perform error correction processing on the currently input text information
  • a first result providing unit configured to provide real-time search result data based on the currently input text information feedback
  • the approximate word calculation unit is adapted to perform error correction processing on the text information, and when calculating an error, calculate an approximate character that matches the character data included in the text information found to be erroneous;
  • the error correction prompting unit is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;
  • the second result providing unit is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user.
  • a computer program comprising computer readable code that, when executed on a computing device, causes the computing device to perform the determining of a shape in a search engine as described above Near-word method, method for providing keyword error correction in search, and instant search method.
  • a computer readable medium wherein the computer program described above is stored.
  • the search keyword is subjected to error correction processing, and the search keyword is rewritten by using a near-word matching the search keyword to obtain search result data that matches the rewritten search keyword.
  • the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency.
  • to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords.
  • the user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
  • error correction processing is performed on the text information in the real-time search engine, and the search keyword is rewritten by using approximate text matching the text information to obtain search result data that matches the rewritten text information.
  • the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency.
  • to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords.
  • the user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
  • FIG. 1 is a flow chart showing the steps of an embodiment of a method for determining a near-word in a search engine, in accordance with one embodiment of the present invention
  • FIG. 2 is a flow chart showing the steps of an embodiment of a method for providing keyword error correction in a search according to an embodiment of the present invention
  • FIG. 3 is a flow chart showing the steps of an embodiment of an instant search method in accordance with an embodiment of the present invention.
  • FIG. 4 is a block diagram schematically showing an embodiment of an apparatus for determining a near-word in a search engine according to an embodiment of the present invention
  • FIG. 5 is a block diagram schematically showing an embodiment of an apparatus for providing error correction of keywords in a search according to an embodiment of the present invention
  • FIG. 6 is a block diagram showing the structure of an embodiment of an instant search system according to an embodiment of the present invention.
  • Figure 7 shows schematically a block diagram of a computing device for performing the method according to the invention
  • Fig. 8 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention. .
  • FIG. 1 there is shown a method of determining a near-word in a search engine in accordance with one embodiment of the present invention.
  • the flow chart of the steps of the embodiment may include the following steps:
  • Step 101 Determine a first text and a second text to be verified in the input search engine
  • the processing flow of the search engine can generally be divided into two parts, the first part is the front-end user request, and the second part is the back-end production data.
  • the front-end user request processing process can include:
  • the user enters a keyword
  • Sorting sorting candidate webpages according to dimensions such as content relevance and timeliness;
  • the backend production data process can include:
  • Index production analyze the crawled and saved webpages, segment the page title and page text, and make an inverted index based on the word segmentation results for front-end retrieval.
  • the webpage crawled by the crawler can be saved in the webpage database, and the webpage stores a lot of text information, and the webpage database can also be called a corpus.
  • the first text and the second text may be extracted from the corpus to perform verification of whether the characters are close to each other.
  • the first character and the second character may be Chinese characters.
  • Step 102 Acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule.
  • the text can have a specific character structure characteristic, and is encoded according to the character structure characteristic, and an input mode is established, so that inputting characters in the electronic device can be realized.
  • the first character and the second character can perform a pinyin input mode, a five-stroke input mode, a stroke input mode, and the like.
  • first text and the second text may correspond to different first encoded strings and second encoded strings for different encoding rules.
  • the code string corresponding to the Pinyin input mode of the "side” is "ce”
  • the code string corresponding to the Wubi input mode is "WMJh”.
  • the preset rule may include a preset encoding rule
  • step 102 may include the following sub-steps:
  • Sub-step S11 calculating a first encoded character string corresponding to the first character according to a preset encoding rule
  • Sub-step S12 calculating a second encoded character string corresponding to the second character according to the encoding rule
  • the preset encoding rule may include a five-stroke encoding rule.
  • Chinese characters are composed of strokes or radicals.
  • Chinese characters can be broken into some of the most commonly used basic units, namely the root.
  • the root can be the radical part of the Chinese character, or it can be part of the radical, or even a stroke.
  • the characters can be divided into four types according to the positional relationship between the roots: single, scattered, connected, and intersected.
  • the single can refer to the root itself as a Chinese character, including the key name root and the word root, such as mouth, wood, etc.
  • scattered can mean that the roots constituting the Chinese character can maintain a certain distance, such as Han, Xiang, etc.
  • intersection can refer to the intersection of several radicals to form a Chinese character, for example, " ⁇ " is made by " ⁇ " ".
  • Wubi is the abbreviation of Wubi input method, which is a shape code input method.
  • the root is the basic unit of the five-stroke input method.
  • the Chinese characters are encoded according to the strokes and glyph features, the roots are classified according to certain rules, and these roots are distributed on the keyboard as the basic unit for inputting Chinese characters.
  • the five strokes divide the Chinese character pen into five zones: horizontal (same as mention), vertical, ⁇ , ⁇ (same point), and five Area.
  • the roots or symbols are distributed on 25 letter keys in a certain pattern (ie standard QWERTY keyboard, excluding Z key).
  • the keys corresponding to the roots of the keyboard may be sequentially pressed according to the writing order and structure of the Chinese characters to form an encoded character string, and the system according to the input character group of the encoded character string in the five-stroke input method The desired text is retrieved from the font.
  • the Wubi input method although the application of the identification code makes the rate of the re-code (encoded string) of a single character low, the repetition rate of the phrase is high. Therefore, the Wubi input method generally does not use a large vocabulary to prevent over-multiple codes. Conversely, the Wubi input method is especially suitable for a single text input to achieve higher input efficiency.
  • Step 103 Calculate an encoding distance between the first encoded string and the second encoded string.
  • the similarity between the first encoded string and the second encoded string can be identified.
  • the encoding distance may include an editing distance.
  • Edit Distance also known as Levenshtein distance, can refer to the minimum number of edit operations required to convert from one string to another, such as the first encoded string and the second encoded string.
  • Step 104 When the coding distance is less than a preset distance threshold, determine that the first character and the second character are in close proximity to each other.
  • the near-word can be a text with a similar glyph structure, which is confusing when used. For example, “self”, “has”, and “ ⁇ " are close to each other.
  • the root or symbol is generally in the form of a block, which is the same as or similar to the stroke or the radical that constitutes the text, and is concentrated in one or adjacent keys.
  • the root of the H key corresponds to "mesh, top, bu, stop, tiger, head, and".
  • the splitting rules may include: writing order, taking precedence, taking into account the intuitiveness, being able to connect, and not being able to connect.
  • the strokes that make up the text or the capital of the radicals have certain rules of use, which can include position rules, writing rules, and so on. For example, “single” next to a single person and “ ⁇ " next to a double person are generally on the far left side of the text, and the highest priority is written, such as “you", “100 million”, “very”, “to”, and so on.
  • strokes or radicals allow Chinese characters to be divided into single characters (words consisting of strokes such as top, bottom, day, and month, or words consisting of a single radical) and fit words (such as hanging, rest, Take, Ming, etc. consist of words from the radicals).
  • the Chinese character structure can be divided into:
  • the coding distance is calculated for the first code string and the second code string corresponding to the first character and the second character.
  • the value is smaller than the preset distance threshold, the similarity is high, and the shape may be considered to be a shape. Near word.
  • the coding distance is greater than or equal to the preset distance threshold, it indicates that the similarity is low and can be regarded as a non-close-word.
  • the distance threshold can be preset to 2.
  • the encoding string of "waiting” is "whnd”
  • the encoding string of "hou” is “wntd”
  • the encoding between "whnd” and “wntd” If the distance is 1, less than the distance threshold 2, it can be determined that "waiting" and "hou” are mutually similar.
  • Step 105 Establish a near-word mapping relationship between the first text and the second text in the search engine.
  • the font database may be separately established in the search engine to collect the near-words of the current text and the corresponding near-word mapping relationship.
  • the near-word mapping relationship may be mutual.
  • Step 106 Output the first character and the second character and the near-word mapping relationship of the mutually similar words to a specified font database.
  • all the characters can be traversed in the corpus, the near-word of the current text can be searched, and the searched near-word and near-word mapping relationship can be generated to generate a font database of the current text.
  • the first character font database stores one or more near-word and near-word mapping relationships, such as the first text—the second text, the third text, and the fourth text; and the second text font database stores one Or a plurality of near-word and near-word mapping relationships, such as the second text—the first text, the fifth text, and the sixth text.
  • FIG. 2 a flow chart of steps of an embodiment of a method for providing keyword error correction in a search according to an embodiment of the present invention is shown, which may include the following steps:
  • Step 201 Receive a search request, where the search request includes a search keyword
  • the search request may refer to an instruction issued by the user to search using a certain search keyword.
  • a user may issue a search request through a search engine web page, or in a search plugin to issue a search request, and the like.
  • the user enters a search keyword in the search box of the search engine and clicks or presses the enter key, it is equivalent to A search request has been received; likewise, when a search keyword is entered in the input box of the search plugin and the user presses or presses the enter key, it is equivalent to receiving the search request.
  • Step 202 When an error is found in the error correction processing on the search keyword, the search keyword is rewritten by using a near word matching the search keyword;
  • the search keyword may be error-corrected using Natural Language Processing (NLP).
  • NLP Natural Language Processing
  • Error correction processing can generally be split into two subtasks:
  • Non-word Errors can mean that the word after spelling is not legal, such as the wrong "giraffe” is written as “graffe”;
  • Real-word Errors can refer to those cases where the spelling error is still legal, such as "there” is spelled “three” (nearly), “peace” is mistakenly spelled as "piece” (same), and "two” is mistakenly spelled as "too” (same).
  • spelling correction may be performed based on a noisysy Channel Model or the like;
  • Spelling Error Correction error correction of search keywords, can be used to check words, such as errors between adjacent words and words, adjacent words and words, adjacent words and words The check is performed, and the search keyword is rewritten by searching for the closest word in the font database that matches the text at the error in the near-word mapping relationship.
  • the near-word can be obtained in the following manner:
  • Sub-step S21 determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word
  • Sub-step S22 acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule
  • the preset rule may include a preset encoding rule
  • the sub-step S22 may further include the following sub-steps:
  • Sub-step S222 calculating a second encoded character string corresponding to the second character according to the encoding rule
  • the preset encoding rule may include a five-stroke encoding rule.
  • Sub-step S23 calculating an encoding distance between the first encoded character string and the second encoded character string
  • Sub-step S24 when the encoding distance is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;
  • Sub-step S25 establishing a near-word mapping relationship between the first text and the second text in the search engine.
  • the near-word can be obtained in the following manner:
  • Sub-step S31 determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word
  • Sub-step S32 acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule
  • Sub-step S33 calculating an encoding distance between the first encoded character string and the second encoded character string
  • Sub-step S34 respectively searching for a first input button corresponding to the first encoded character string
  • Sub-step S35 respectively searching for a second input button corresponding to the second encoded character string
  • Sub-step S36 respectively calculating a button distance between the first input button and the second input button
  • Sub-step S37 configuring a weight corresponding to the coding distance according to the button distance
  • Sub-step S38 when the coded distance configured with the weight is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;
  • Sub-step S39 establishing a near-word mapping relationship between the first text and the second text in the search engine.
  • the key distance between the first input button and the second input button may be the physical distance of the input button on the keyboard.
  • the buttons F and J generally have protrusions as positioning keys.
  • the left index finger clicks the button E Due to the presence of the positioning key, when the current finger clicks on a button that does not belong to the control, for example, the left index finger clicks the button E, and the finger span is large, so that the user generally has obvious discomfort, and thus the probability of such a wrong click is small. Conversely, the probability of accidental clicks on the currently controlled finger button is relatively large. For example, the left index finger clicks the button R, and it is easy to accidentally click T.
  • the button distance can be inversely proportional to the weight.
  • the button distance between the input buttons controlled by the same finger can configure the weight coefficient for the weight, and reduce the weight, so that the coding distance of the first text and the second text is smaller, that is, the similarity is higher, so as to reflect the error. The probability of clicking is relatively large.
  • Step 203 Perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
  • the network information can be retrieved and matched by using full-text indexing, directory indexing, and the like.
  • Step 204 Generate a search result page according to the search result data.
  • the search engine searches in the database. If the network information matching the content requested by the user is found, the relevance and ranking level of each webpage are generally calculated according to the matching degree of the keywords in the network information, the location, the frequency, the link quality, and the like. Then, according to the degree of association, these network information links are returned to the user in order.
  • Step 205 Prompt information for correcting the search keyword in the search result page.
  • the embodiment of the present invention may be prompted in any form.
  • the information about the error correction of the search keyword may be prompted under the input box of the search engine, and the enhanced prompt function may also be used before the error correction.
  • the text and the error-corrected characters are marked with different colors, and the like, which is not limited by the embodiment of the present invention.
  • the search keyword is subjected to error correction processing, and the search keyword is rewritten by using a near-word matching the search keyword to obtain search result data that matches the rewritten search keyword.
  • the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency.
  • to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords.
  • the user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
  • a flow chart of steps of an embodiment of an instant search method according to an embodiment of the present invention may be included, which may include the following steps:
  • Step 301 Detect text information currently input in the search bar
  • ISE Instant Search Search Engine
  • RSS Simple Information Aggregation
  • Atom a pair of related standards
  • Tag categories tags
  • the real-time search engine can detect the text information input by the user in the search bar. As the user inputs the text information in the search bar, the real-time search engine can simultaneously give the search result, and the user continuously inputs new text. Information, the search results page that the instant search engine can refresh at any time will change together.
  • Step 302 Perform error correction processing on the currently input text information.
  • search keywords may be error-corrected using Natural Language Processing (NLP).
  • NLP Natural Language Processing
  • the language information that is currently input may be error-corrected by using a language model.
  • the instant search engine can pre-acquire the user's input text information and then train the language model.
  • the trained model can be N-Gram (a language model commonly used in large vocabulary continuous speech recognition), a neural network-based language model, etc., and the learning of the user language model can be performed in a regular or client-side manner.
  • the above-mentioned error correction processing method is only an example.
  • other error correction processing methods may be set according to actual conditions, which is not limited by the embodiment of the present invention.
  • those skilled in the art may also adopt other error correction processing methods according to actual needs, which is not limited in the embodiment of the present invention.
  • Step 303 providing real-time search result data based on the currently input text information feedback
  • the user can automatically initiate a query request to the instant search engine and receive the search result display without triggering the query request by clicking the Enter key.
  • Step 304 When an error is detected in the error correction processing on the text information, an approximate text matching the character data included in the text information found to be erroneous is calculated;
  • the approximation word can include a near-word and/or a near-word.
  • the pronunciation sound can be the same or similar words, for example, the pronunciation of "case” and “ ⁇ ” is “an”.
  • Chinese pinyin is composed of initials and finals, and the similarity between the initials and finals of the first and second characters can be calculated separately, and the similarity between the pronunciations is obtained.
  • the similarity is greater than the preset similarity threshold, It is determined that the first character and the second character are near sound words.
  • the font information is searched for in the font database to find the closest matching text corresponding to the text at the error.
  • the near-word can be obtained in the following manner:
  • Sub-step S41 determining the first text and the second text to be verified in the input search engine
  • the first text and the second text may be extracted from a preset collected corpus to perform verification of whether the characters are close to each other.
  • the first character and the second character may be Chinese characters.
  • Sub-step S42 acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule
  • the preset rule may include a preset encoding rule
  • the sub-step 42 may further include the following sub-steps:
  • Sub-step S421 calculating a first encoded character string corresponding to the first character according to a preset encoding rule
  • Sub-step S422 calculating a second encoded character string corresponding to the second character according to the encoding rule
  • the preset encoding rule may include a five-stroke encoding rule.
  • Sub-step S43 calculating an encoding distance between the first encoded character string and the second encoded character string
  • the encoding distance may include an editing distance.
  • Sub-step S44 when the encoding distance is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other.
  • Sub-step S45 establishing a near-word mapping relationship between the first text and the second text in the search engine.
  • the near-word can be obtained in the following manner:
  • Sub-step S51 determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word
  • Sub-step S52 acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule
  • Sub-step S23 calculating an encoding distance between the first encoded character string and the second encoded character string
  • Sub-step S54 respectively searching for the first input button corresponding to the first encoded character string
  • Sub-step S55 respectively searching for a second input button corresponding to the second encoded character string
  • Sub-step S56 respectively calculating a button distance between the first input button and the second input button
  • Sub-step S57 configuring a weight corresponding to the coding distance according to the button distance
  • Sub-step S58 when the coded distance configured with the weight is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;
  • Sub-step S59 establishing a near-word mapping relationship between the first text and the second text in the search engine.
  • Step 305 insert, in the instant search result data, prompt information of a recommended approximate text for correcting the text information of the found error;
  • the embodiment of the present invention may perform prompting in any form. For example, information indicating that the recommended approximate text is corrected may be prompted under the input box, and the enhanced prompt function may also be used for the text and recommendation before the error correction.
  • the text is marked with a different color, and the like, which is not limited by the embodiment of the present invention.
  • Step 306 When receiving a trigger indication of the prompt information by the user, provide real-time search result data that is searched by the approximate text corresponding to the trigger indication.
  • the trigger indication may refer to an instruction sent by the user to replace the found text message with an approximate text. For example, when the user clicks at the prompt information, it is equivalent to receiving the trigger indication. For another example, when the user selects an approximate text by using a button such as the Tab key and then presses the enter key, it is equivalent to receiving the trigger indication.
  • the instant search result data of the text information feedback after the error is found based on the trigger indication may be provided again.
  • error correction processing is performed on the text information in the real-time search engine, and the search keyword is rewritten by using approximate text matching the text information to obtain search result data that matches the rewritten text information.
  • the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency.
  • to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords.
  • the user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
  • FIG. 4 a block diagram of an embodiment of an apparatus for determining a near-word in a search engine according to an embodiment of the present invention is shown, which may include the following modules:
  • the text determining module 401 is adapted to determine the first text and the second text to be verified in the input search engine
  • the encoding obtaining module 402 is configured to acquire the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;
  • the encoding distance calculation module 403 is adapted to calculate an encoding distance between the first encoded character string and the second encoded character string;
  • the near-word determining module 404 is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
  • the mapping relationship determining module 405 is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
  • the preset rule may include a preset encoding rule
  • the encoding obtaining module may further be configured to:
  • the preset encoding rule includes a five-stroke encoding rule.
  • an output module configured to output the first character and the second character and the near-word mapping relationship of the mutual near-word to a specified font database.
  • FIG. 5 a structural block diagram of an apparatus for providing error correction of keywords in a search according to an embodiment of the present invention is shown, which may include the following units:
  • the receiving unit 501 is adapted to receive a search request; the search request includes a search keyword;
  • the rewriting unit 502 is adapted to rewrite the search keyword by using a near-word matching the search keyword when an error is detected in the error correction processing on the search keyword;
  • the searching unit 503 is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
  • the near-word can be obtained by calling the following modules:
  • a text determining module configured to determine a first text and a second text to be verified in the input search engine
  • a code obtaining module configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule
  • a coding distance calculation module configured to calculate an encoding distance between the first encoded character string and the second encoded character string
  • the near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
  • the mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
  • the preset rule may include a preset encoding rule
  • the encoding obtaining module is further adapted to:
  • the preset encoding rule includes a five-stroke encoding rule.
  • the near-word can also be obtained by calling the following modules:
  • a first search module configured to separately search for a first input button corresponding to the first encoded character string
  • a second search module configured to separately search for a second input button corresponding to the second encoded character string
  • a button distance calculation module configured to separately calculate a button distance between the first input button and the second input button
  • a weight configuration module configured to configure a weight corresponding to the coding distance according to the button distance
  • the shape near word determination module may also be adapted to:
  • the button distance may be inversely proportional to the weight.
  • a generating unit is adapted to generate a search result page based on the search result data.
  • the prompting unit is adapted to prompt information for correcting the search keyword in the search result page.
  • FIG. 6 a structural block diagram of an embodiment of an instant search system according to an embodiment of the present invention is shown, which may include the following modules:
  • the text information detecting unit 601 is adapted to detect text information currently input in the search bar;
  • the error correction processing unit 602 is adapted to perform error correction processing on the currently input text information
  • the first result providing unit 603 is adapted to provide real-time search result data based on the currently input text information feedback;
  • the approximate word calculation unit 604 is adapted to perform an error correction process on the text information to find an approximate character that matches the character data included in the erroneous text information when an error is found;
  • the error correction prompting unit 605 is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;
  • the second result providing unit 606 is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user.
  • the approximation word may comprise a near-word and/or a near-word.
  • the near-word can be obtained by calling the following modules:
  • a text determining module configured to determine a first text and a second text to be verified in the input search engine
  • a code obtaining module configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule
  • a coding distance calculation module configured to calculate an encoding distance between the first encoded character string and the second encoded character string
  • the near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
  • the mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
  • the preset rule may include a preset encoding rule
  • the encoding obtaining module may further be configured to:
  • the preset encoding rule includes a five-stroke encoding rule.
  • the near-word can also be obtained by calling the following modules:
  • a first search module configured to separately search for a first input button corresponding to the first encoded character string
  • a second search module configured to separately search for a second input button corresponding to the second encoded character string
  • a button distance calculation module configured to separately calculate a button distance between the first input button and the second input button
  • a weight configuration module configured to configure a weight corresponding to the coding distance according to the button distance
  • the shape near word determination module may also be adapted to:
  • the button distance may be inversely proportional to the weight.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement determining a near-word in a search engine and/or providing keyword correction in a search in accordance with an embodiment of the present invention. Wrong and/or instant search for some or all of the functionality of some or all of the components.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 7 illustrates a computing device, such as an application server, that can implement a near-word in a search engine, provide keyword error correction in a search, and an instant search in accordance with the present invention.
  • the computing device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720.
  • Memory 720 can be an electronic memory such as a flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • Memory 720 has a memory space 730 for program code 731 for performing any of the method steps described above.
  • storage space 730 for program code may include various program code 731 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 720 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 731', ie, code readable by a processor, such as 710, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a method and apparatus for determining similar characters in a search engine. The method comprises: determining a first character and a second character that are input into a search engine and are to be checked; acquiring a first code character string of the first character and a second code character string of the second character according to a preset rule; calculating a code distance between the first code character string and the second code character string; when the code distance is less than a preset distance threshold, determining that the first character and the second character are similar characters; and establishing a similar character mapping between the first character and the second character in the search engine. Embodiments reflect determination of whether a first character and a second character are similar characters, thereby improving webpage recognition efficiency of a search engine and provide the search engine with more functions.

Description

一种在搜索引擎中确定形近字的方法和装置Method and device for determining shape near words in search engine 技术领域Technical field
本发明涉及语言文字信息的技术领域,尤其涉及一种在搜索引擎中确定形近字的方法、一种提供搜索中文关键词纠错的方法、一种即时搜索方法、一种在搜索引擎中确定形近字的装置、一种提供搜索中文关键词纠错的装置和一种即时搜索系统。The invention relates to the technical field of language text information, in particular to a method for determining a near-word in a search engine, a method for providing a search for a Chinese keyword error correction, an instant search method, and a determination in a search engine. A near-word device, a device for providing error correction for searching Chinese keywords, and an instant search system.
背景技术Background technique
随着互联网的高速发展,网络应用趋向多元化,网上的信息量急剧增加。With the rapid development of the Internet, network applications tend to be diversified, and the amount of information on the Internet has increased dramatically.
在各种场合下,用户经常需要输入语言文字进行信息的交互。例如,在搜索引擎中输入关键词搜索网页信息,在即时通讯工具中输入词句与其他用户进行交流,等等。In various situations, users often need to input language texts for information interaction. For example, enter a keyword search web page information in a search engine, enter words and phrases in an instant messaging tool to communicate with other users, and the like.
语言文字存在形近字,即语言文字的结构相似的语言文字。语言文字被定义为各种编码方式进行输入,例如五笔编码、拼音编码等等,用户在采用该编码方式输入语言文字时,由于形近字的原因,很容易出现误操作,输入其他语言文字,导致用户经常需要重新输入语言文字,不仅操作麻烦,而且浪费系统资源。Language words exist in the form of near-words, that is, language characters with similar structures. Language characters are defined as various encoding methods for input, such as five-stroke encoding, pinyin encoding, etc. When the user inputs the language text by using the encoding method, due to the shape of the near-word, it is easy to cause misoperation and input other language characters. As a result, users often need to re-enter language text, which is not only troublesome, but also wastes system resources.
以五笔为例,五笔输入文字准不准确取决于用户是否细心或对汉字本身的认知,但是由于粗心导致的误操作或用户认知本身就是错别字导致的输错汉字的情形等并不少见,例如某新闻报纸的某次头版头条“乱揿喇叭被罚不要喊冤”写成了“乱揿嗽叭被罚不要喊冤”。Taking Wushu as an example, the inaccuracy of the five-stroke input text depends on whether the user is careful or cognizant about the Chinese character itself. However, it is not uncommon for the mishandling caused by carelessness or the user's cognition itself to be the wrong type of Chinese characters caused by the typos. For example, a headline of a news newspaper, "The screaming horn is being punished and not shouting," was written as "a slap in the face and not being shouted."
再者,若用户在搜索引擎中想输入搜索词“项羽”,搜索历史人物项羽的相关网页信息,但是将“项”误输入为“顶”,由于“项”和“顶”也很相近,用户很可能输入了“顶羽”而没有察觉,直接请求搜索引擎搜索与“顶羽”相关的网页信息。Furthermore, if the user wants to input the search term "Xiang Yu" in the search engine, the relevant webpage information of the historical character Xiang Yu is searched, but the "item" is mistakenly entered as "top", since the "item" and the "top" are also similar. The user is likely to enter the "top feather" without being aware of it, and directly requests the search engine to search for web page information related to "top feather".
一方面,误操作的搜索结果与当初的预期有很大差别,用户体验十分差,浪费了客户端的资源和搜索引擎的资源。另一方面,用户需要获取自己感兴趣的网页信息,会再次在搜索引擎中输入关键词进行搜索,搜索引擎要再次进行海量信息的搜索、对比、筛选等获取与搜索关键词相关的信息,不仅用户操作更加繁琐,耗费用户的时间,而且将大大增加搜索引擎的负担,耗费更多客户端与搜索引擎的资源。On the one hand, mis-operational search results are very different from the original expectations, the user experience is very poor, wasting the resources of the client and the resources of the search engine. On the other hand, the user needs to obtain the webpage information that he is interested in, and will input the keyword again in the search engine to search. The search engine must search, compare, and filter the massive information again to obtain information related to the search keyword, not only the information, but also the information related to the search keyword. User operations are more cumbersome and time consuming, and will greatly increase the burden on search engines and consume more resources of clients and search engines.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种在搜索引擎中确定形近字的方法、一种提供搜索中文关键词纠错的方法、一种即时搜索方法和相应的一种在搜索引擎中确定形近字的装置、一种提供搜索中文关键词纠错的装置、一种即时搜索系统。In view of the above problems, the present invention has been made in order to provide a method for determining a near-word in a search engine, a method for providing a Chinese keyword correction, and a method for overcoming the above problems or at least partially solving the above problems. An instant search method and a corresponding device for determining a near-word in a search engine, a device for providing error correction for searching Chinese keywords, and an instant search system.
依据本发明的一个方面,提供了一种在搜索引擎中确定形近字的方法,包括:According to an aspect of the present invention, a method for determining a near-word in a search engine is provided, including:
确定输入搜索引擎中的待校验的第一文字和第二文字;Determining the first text and the second text to be verified in the input search engine;
按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;
计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Calculating an encoding distance between the first encoded string and the second encoded string;
当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;
在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。A near-word mapping relationship between the first text and the second text is established in the search engine.
根据本发明的另一方面,提供了一种提供搜索中关键词纠错的方法,包括:According to another aspect of the present invention, a method for providing keyword error correction in a search is provided, comprising:
接收搜索请求;所述搜索请求中包括搜索关键词;Receiving a search request; the search request includes a search keyword;
当对所述搜索关键词进行纠错处理发现错误时,采用与所述搜索关键词匹配的形近字对所述搜索关键词进行改写; When an error is found in the error correction processing on the search keyword, the search keyword is rewritten by using a near word matching the search keyword;
以改写后的搜索关键词进行搜索,获得与所述改写后的搜索关键词相匹配的搜索结果数据。The search is performed by the rewritten search keyword, and search result data matching the rewritten search keyword is obtained.
根据本发明的另一方面,提供了一种即时搜索方法,包括:According to another aspect of the present invention, an instant search method is provided, comprising:
检测搜索栏中当前输入的文字信息,对当前输入的文字信息进行纠错处理,并提供基于当前输入的文字信息反馈的即时搜索结果数据;Detecting the currently input text information in the search bar, performing error correction processing on the currently input text information, and providing real-time search result data based on the currently input text information feedback;
当对所述文字信息进行纠错处理发现错误时,计算与发现错误的文字信息中包含的字符数据匹配的近似文字;When an error is found in the error correction processing of the text information, an approximate character matching the character data included in the text information found to be erroneous is calculated;
在所述即时搜索结果数据插入针对所述发现错误的文字信息进行纠错的推荐近似文字的提示信息;And prompting, in the instant search result data, the prompt information of the recommended approximate text for correcting the text information of the found error;
当接收到用户对所述提示信息的触发指示时,提供以所述触发指示对应的近似文字进行搜索的即时搜索结果数据。When the trigger indication of the prompt information by the user is received, the instant search result data that is searched by the approximate text corresponding to the trigger indication is provided.
根据本发明的另一方面,提供了一种在搜索引擎中确定形近字的装置,包括:According to another aspect of the present invention, an apparatus for determining a near-word in a search engine is provided, comprising:
文字确定模块,适于确定输入搜索引擎中的待校验的第一文字和第二文字;a text determining module, configured to determine a first text and a second text to be verified in the input search engine;
编码获取模块,适于按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
编码距离计算模块,适于计算所述第一编码字符串和所述第二编码字符串之间的编码距离;a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;
形近字判定模块,适于在所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
映射关系确定模块,适于在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
根据本发明的另一方面,提供了一种提供搜索中关键词纠错的装置,包括:According to another aspect of the present invention, an apparatus for providing keyword error correction in a search is provided, including:
接收单元,适于接收搜索请求;所述搜索请求中包括搜索关键词;a receiving unit, configured to receive a search request; the search request includes a search keyword;
改写单元,适于在对所述搜索关键词进行纠错处理发现错误时,采用与所述搜索关键词匹配的形近字对所述搜索关键词进行改写;The rewriting unit is adapted to rewrite the search keyword by using a shape near word matching the search keyword when an error is detected in the error correction processing of the search keyword;
搜索单元,适于以改写后的搜索关键词进行搜索,获得与所述改写后的搜索关键词相匹配的搜索结果数据。The search unit is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
根据本发明的另一方面,提供了一种即时搜索系统,包括:According to another aspect of the present invention, an instant search system is provided, comprising:
文字信息检测单元,适于检测搜索栏中当前输入的文字信息;a text information detecting unit, configured to detect text information currently input in the search bar;
纠错处理单元,适于对当前输入的文字信息进行纠错处理;The error correction processing unit is adapted to perform error correction processing on the currently input text information;
第一结果提供单元,适于提供基于当前输入的文字信息反馈的即时搜索结果数据;a first result providing unit, configured to provide real-time search result data based on the currently input text information feedback;
近似字计算单元,适于对所述文字信息进行纠错处理发现错误时,计算与发现错误的文字信息中包含的字符数据匹配的近似文字;The approximate word calculation unit is adapted to perform error correction processing on the text information, and when calculating an error, calculate an approximate character that matches the character data included in the text information found to be erroneous;
纠错提示单元,适于在所述即时搜索结果数据插入针对所述发现错误的文字信息进行纠错的推荐近似文字的提示信息;The error correction prompting unit is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;
第二结果提供单元,适于在接收到用户对所述提示信息的触发指示时,提供以所述触发指示对应的近似文字进行搜索的即时搜索结果数据。根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行上述的在搜索引擎中确定形近字的方法、提供搜索中关键词纠错的方法、即时搜索方法。The second result providing unit is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user. According to still another aspect of the present invention, a computer program is provided, comprising computer readable code that, when executed on a computing device, causes the computing device to perform the determining of a shape in a search engine as described above Near-word method, method for providing keyword error correction in search, and instant search method.
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了上述的计算机程序。According to still another aspect of the present invention, a computer readable medium is provided, wherein the computer program described above is stored.
本发明的有益效果为: The beneficial effects of the invention are:
本发明实施例通过在搜索引擎中计算第一文字的第一编码字符串和第二文字的第二编码字符串之间的编码距离,实现了第一文字和第二文字是否互为形近字的判定,提高了搜索引擎的网页识别效率,增加了搜索引擎的功能。In the embodiment of the present invention, by calculating the coding distance between the first encoded character string of the first character and the second encoded character string of the second character in the search engine, whether the first character and the second character are mutually adjacent to each other is determined. It improves the efficiency of search engine web page recognition and increases the function of search engine.
本发明实施例对搜索关键词进行纠错处理,采用与搜索关键词匹配的形近字对搜索关键词进行改写,以获得与所述改写后的搜索关键词相匹配的搜索结果数据。一方面,改写后的搜索关键词使得搜索结果更加接近当初的预期,提升用户体验,减少了客户端的资源和搜索引擎的资源浪费,提高了搜索效率。另一方面,避免用户需要获取自己感兴趣的网页信息,再次在搜索引擎中输入关键词进行搜索,减少了搜索引擎要再次进行海量信息的搜索、对比、筛选等获取与搜索关键词相关的信息,使得用户操作更加方便,减少用户的时间耗费,进一步减少了客户端与搜索引擎的资源耗费。In the embodiment of the present invention, the search keyword is subjected to error correction processing, and the search keyword is rewritten by using a near-word matching the search keyword to obtain search result data that matches the rewritten search keyword. On the one hand, the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency. On the other hand, to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords. The user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
本发明实施例在即时搜索引擎中对文字信息进行纠错处理,采用与文字信息匹配的近似文字对搜索关键词进行改写,以获得与所述改写后的文字信息相匹配的搜索结果数据。一方面,改写后的搜索关键词使得搜索结果更加接近当初的预期,提升用户体验,减少了客户端的资源和搜索引擎的资源浪费,提高了搜索效率。另一方面,避免用户需要获取自己感兴趣的网页信息,再次在搜索引擎中输入关键词进行搜索,减少了搜索引擎要再次进行海量信息的搜索、对比、筛选等获取与搜索关键词相关的信息,使得用户操作更加方便,减少用户的时间耗费,进一步减少了客户端与搜索引擎的资源耗费。In the embodiment of the present invention, error correction processing is performed on the text information in the real-time search engine, and the search keyword is rewritten by using approximate text matching the text information to obtain search result data that matches the rewritten text information. On the one hand, the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency. On the other hand, to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords. The user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示意性示出了根据本发明一个实施例的一种在搜索引擎中确定形近字的方法实施例的步骤流程图;1 is a flow chart showing the steps of an embodiment of a method for determining a near-word in a search engine, in accordance with one embodiment of the present invention;
图2示意性示出了根据本发明的一个实施例的一种提供搜索中关键词纠错的方法实施例的步骤流程图;2 is a flow chart showing the steps of an embodiment of a method for providing keyword error correction in a search according to an embodiment of the present invention;
图3示意性示出了根据本发明的一个实施例的一种即时搜索方法实施例的步骤流程图;3 is a flow chart showing the steps of an embodiment of an instant search method in accordance with an embodiment of the present invention;
图4示意性示出了根据本发明一个实施例的一种在搜索引擎中确定形近字的装置实施例的结构框图;4 is a block diagram schematically showing an embodiment of an apparatus for determining a near-word in a search engine according to an embodiment of the present invention;
图5示意性示出了根据本发明一个实施例的一种提供搜索中关键词纠错的装置实施例的结构框图;FIG. 5 is a block diagram schematically showing an embodiment of an apparatus for providing error correction of keywords in a search according to an embodiment of the present invention; FIG.
图6示意性示出了本发明一个实施例的一种即时搜索系统实施例的结构框图;6 is a block diagram showing the structure of an embodiment of an instant search system according to an embodiment of the present invention;
图7示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及Figure 7 shows schematically a block diagram of a computing device for performing the method according to the invention;
图8示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。。Fig. 8 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention. .
具体实施方式detailed description
下面结合附图和具体的实施方式对本发明作进一步的描述。The invention is further described below in conjunction with the drawings and specific embodiments.
参照图1,示出了本发明的一个实施例的一种在搜索引擎中确定形近字的方法 实施例的步骤流程图,可以包括如下步骤:Referring to Figure 1, there is shown a method of determining a near-word in a search engine in accordance with one embodiment of the present invention. The flow chart of the steps of the embodiment may include the following steps:
步骤101,确定输入搜索引擎中的待校验的第一文字和第二文字;Step 101: Determine a first text and a second text to be verified in the input search engine;
搜索引擎的处理流程一般可以分为二个部分,第一部分是前端用户请求,第二部分是后端制作数据。The processing flow of the search engine can generally be divided into two parts, the first part is the front-end user request, and the second part is the back-end production data.
一、前端用户请求处理过程可以包括:First, the front-end user request processing process can include:
1.用户输入关键字;1. The user enters a keyword;
2.查询词分析,搜索引擎对关键字分词;2. Query word analysis, search engine segmentation of keywords;
3.检索,根据分词结果,从事先制作的索引中,找出相关的网页集合;3. Search, according to the result of the word segmentation, find out the relevant webpage collection from the index created in advance;
4.排序,对候选的网页集合,根据内容相关性、时效性等维度进行排序;4. Sorting, sorting candidate webpages according to dimensions such as content relevance and timeliness;
5.展现:将排序后的网页进行展现。5. Presentation: Display the sorted web pages.
二、后端制作数据过程可以包括:Second, the backend production data process can include:
1.网页抓取,爬虫通过网页间的链接关系,抓取互联网的网页并保存;1. Web crawling, the crawler crawls the webpage of the Internet and saves it through the link relationship between the webpages;
2.索引制作,对已抓取保存的网页进行分析,对网页标题和页面文本分词,根据分词结果制作倒排索引,供前端检索使用。2. Index production, analyze the crawled and saved webpages, segment the page title and page text, and make an inverted index based on the word segmentation results for front-end retrieval.
爬虫抓取的网页可以保存在网页数据库中,而网页中保存着众多的文字信息,则此网页数据库又可以称为语料库。The webpage crawled by the crawler can be saved in the webpage database, and the webpage stores a lot of text information, and the webpage database can also be called a corpus.
在具体实现中,可以从语料库中提取第一文字和第二文字,进行是否互为形近字的校验。In a specific implementation, the first text and the second text may be extracted from the corpus to perform verification of whether the characters are close to each other.
在本发明实施例的一个可选示例中,第一文字和第二文字可以为汉字。In an optional example of the embodiment of the present invention, the first character and the second character may be Chinese characters.
步骤102,按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Step 102: Acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule.
文字可以具有特定的文字结构特性,按照该文字结构特性进行编码,建立输入方式,可以实现进行电子设备中输入文字。例如,第一文字和第二文字可以进行拼音输入方式、五笔输入方式、笔画输入方式等等。The text can have a specific character structure characteristic, and is encoded according to the character structure characteristic, and an input mode is established, so that inputting characters in the electronic device can be realized. For example, the first character and the second character can perform a pinyin input mode, a five-stroke input mode, a stroke input mode, and the like.
相对应地,第一文字和第二文字针对不同的编码规则可以对应不同第一编码字符串和第二编码字符串。例如,“侧”针对拼音输入方式对应的编码字符串为“ce”,针对五笔输入方式对应的编码字符串为“WMJh”。Correspondingly, the first text and the second text may correspond to different first encoded strings and second encoded strings for different encoding rules. For example, the code string corresponding to the Pinyin input mode of the "side" is "ce", and the code string corresponding to the Wubi input mode is "WMJh".
在本发明的一种优选实施例中,所述预设规则可以包括预设的编码规则,步骤102可以包括如下子步骤:In a preferred embodiment of the present invention, the preset rule may include a preset encoding rule, and step 102 may include the following sub-steps:
子步骤S11,按照预设的编码规则计算所述第一文字对应的第一编码字符串;Sub-step S11, calculating a first encoded character string corresponding to the first character according to a preset encoding rule;
子步骤S12,按照所述编码规则计算所述第二文字对应的第二编码字符串;Sub-step S12, calculating a second encoded character string corresponding to the second character according to the encoding rule;
其中,所述预设的编码规则可以包括五笔编码规则。The preset encoding rule may include a five-stroke encoding rule.
汉字是由笔划或偏旁部首组成的,为了输入这些汉字,可以把汉字拆成一些最常用的基本单位,即字根。字根可以是汉字的偏旁部首,也可以是部首的一部分,甚至是笔划。Chinese characters are composed of strokes or radicals. In order to input these Chinese characters, Chinese characters can be broken into some of the most commonly used basic units, namely the root. The root can be the radical part of the Chinese character, or it can be part of the radical, or even a stroke.
字根在组成汉字时,可以按照字根之间的位置关系分为四类结构:单、散、连、交。其中,单可以指字根本身单独成为一个汉字,包括键名字根和成字字根,例如口、木等;散可以指构成汉字的字根之间可以保持一定距离,例如汉、湘等;连可以指一个字根连一单笔画,例如“丿”连“目”成为“自”;交可以指几个字根交叉套迭之后构成汉字,例如“申”是由“日”交“丨”。When the characters form Chinese characters, they can be divided into four types according to the positional relationship between the roots: single, scattered, connected, and intersected. Among them, the single can refer to the root itself as a Chinese character, including the key name root and the word root, such as mouth, wood, etc.; scattered can mean that the roots constituting the Chinese character can maintain a certain distance, such as Han, Xiang, etc.; You can even refer to a single root with a single stroke, such as "丿" and even "mesh" becomes "self"; intersection can refer to the intersection of several radicals to form a Chinese character, for example, "申" is made by "日" ".
五笔为五笔输入法的简称,为一种形码输入法。字根是五笔输入法的基本单元,依据笔画和字形特征对汉字进行编码,把字根按一定的规律分类,再把这些字根分配在键盘上,作为输入汉字的基本单位。Wubi is the abbreviation of Wubi input method, which is a shape code input method. The root is the basic unit of the five-stroke input method. The Chinese characters are encoded according to the strokes and glyph features, the roots are classified according to certain rules, and these roots are distributed on the keyboard as the basic unit for inputting Chinese characters.
具体地,五笔将汉字笔划分为五个区:横(同提)、竖、撇、捺(同点)、折五 区。把字根或码元按一定规律分布在25个字母键上(即标准的QWERTY键盘,不包括Z键)。Specifically, the five strokes divide the Chinese character pen into five zones: horizontal (same as mention), vertical, 撇, 捺 (same point), and five Area. The roots or symbols are distributed on 25 letter keys in a certain pattern (ie standard QWERTY keyboard, excluding Z key).
在采用五笔输入法输入汉字时,可以按照汉字的书写顺序和结构依次按键盘上与字根对应的键,组成一个编码字符串,系统根据输入字根组成的编码字符串,在五笔输入法的字库中检索出所要的文字。When inputting Chinese characters by using the five-stroke input method, the keys corresponding to the roots of the keyboard may be sequentially pressed according to the writing order and structure of the Chinese characters to form an encoded character string, and the system according to the input character group of the encoded character string in the five-stroke input method The desired text is retrieved from the font.
需要说明的是,在五笔输入法中,虽然识别码的应用使得单个文字的重码(编码字符串)率较低,但是词组的重码率较高。因此,五笔输入法普遍不使用大词库,以防止过多重码,反之,五笔输入法尤其适用于单个文字输入,以获得较高的输入效率。It should be noted that in the Wubi input method, although the application of the identification code makes the rate of the re-code (encoded string) of a single character low, the repetition rate of the phrase is high. Therefore, the Wubi input method generally does not use a large vocabulary to prevent over-multiple codes. Conversely, the Wubi input method is especially suitable for a single text input to achieve higher input efficiency.
步骤103,计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Step 103: Calculate an encoding distance between the first encoded string and the second encoded string.
通过计算第一编码字符串和第二编码字符串之间的编码距离,可以标识出第一编码字符串和第二编码字符串之间的相似度。By calculating the encoding distance between the first encoded string and the second encoded string, the similarity between the first encoded string and the second encoded string can be identified.
在本发明实施例的一种优选示例中,所述编码距离可以包括编辑距离。编辑距离(Edit Distance),又称Levenshtein距离,可以指两个字符串(例如第一编码字符串和第二编码字符串)之间,由一个转换为另一个所需的最少编辑操作次数。In a preferred example of an embodiment of the present invention, the encoding distance may include an editing distance. Edit Distance, also known as Levenshtein distance, can refer to the minimum number of edit operations required to convert from one string to another, such as the first encoded string and the second encoded string.
在实际中,许多的编辑操作包括将一个字符串替换成另一个字符串,插入一个字符串,删除一个字符串。In practice, many editing operations include replacing one string with another, inserting a string, and deleting a string.
例如,将字符串“kitten”转换为字符串“sitting”最少需要三次操作次数:For example, converting the string "kitten" to the string "sitting" requires a minimum of three operations:
1、sitten(k→s),即将字符“k”替换为字符“s”;1, sitten (k → s), the character "k" is replaced by the character "s";
2、sittin(e→i),即将字符“e”替换为字符“i”;2, sittin (e → i), the character "e" is replaced by the character "i";
3、sitting(→g),即在字符串“sittin”最后插入字符“g”。3, sitting (→ g), that is, the character "g" is inserted at the end of the string "sittin".
步骤104,当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字。Step 104: When the coding distance is less than a preset distance threshold, determine that the first character and the second character are in close proximity to each other.
形近字可以为字形结构相似的文字,在使用时容易产生混淆。例如“己”、“已”、“巳”互为形近字。The near-word can be a text with a similar glyph structure, which is confusing when used. For example, "self", "has", and "巳" are close to each other.
在五笔输入法中,字根或码元一般为成块的存在,与组成文字的笔画或部偏旁首相同或相近,都集中在某一个或相邻的按键中。例如,某版本的五笔输入法中H键对应的字根包括“目、上、卜、止、虎、头、具”。In the Wubi input method, the root or symbol is generally in the form of a block, which is the same as or similar to the stroke or the radical that constitutes the text, and is concentrated in one or adjacent keys. For example, in a version of the Wubi input method, the root of the H key corresponds to "mesh, top, bu, stop, tiger, head, and".
由于形近字的字形结构相似,对应地,组成形近字的字根也相似。Since the glyph structures of the near-words are similar, correspondingly, the radicals constituting the near-words are similar.
在采用五笔输入法输入单个文字时,除了少数的键名字根和成字字根外,大多数情况都需要按照汉字的特点采用拆分规则对文字进行字根拆分,如果拆分获得超过四个字根时,取第一、二、三、末(最后)个字根即可输入文字。When using the Wubi input method to input a single text, except for a few key name roots and word roots, in most cases, it is necessary to use the splitting rules to split the text according to the characteristics of Chinese characters. If the split is more than four For each root, enter the first, second, third, and last (last) roots to enter the text.
例如,拆分规则可以包括:书写顺序、取大优先、兼顾直观、能连不交、能散不连。For example, the splitting rules may include: writing order, taking precedence, taking into account the intuitiveness, being able to connect, and not being able to connect.
组成文字的笔画或部偏旁首都是具有一定的使用规则的,可以包括位置规则、书写规则等等。例如单人旁“亻”、双人旁“彳”一般是在文字的最左侧,最优先书写,如“你”、“亿”、“很”、“往”等。The strokes that make up the text or the capital of the radicals have certain rules of use, which can include position rules, writing rules, and so on. For example, "single" next to a single person and "彳" next to a double person are generally on the far left side of the text, and the highest priority is written, such as "you", "100 million", "very", "to", and so on.
笔画或偏旁部首的使用规则使得汉字可以分为独体字(如上、下、日、月等由笔画构成的字,或者是说由单个偏旁组成的字)和合体字(如挂、休、取、明等由偏旁组成的字)。The rules for the use of strokes or radicals allow Chinese characters to be divided into single characters (words consisting of strokes such as top, bottom, day, and month, or words consisting of a single radical) and fit words (such as hanging, rest, Take, Ming, etc. consist of words from the radicals).
具体地,汉字结构可以分为:Specifically, the Chinese character structure can be divided into:
(1)上下结构:思、歪、冒、意、安、全;(1) Upper and lower structure: thinking, swearing, taking risks, meaning, safety, and total;
(2)上中下结构:草、暴、意、竟、竞;(2) Upper, middle and lower structures: grass, violence, intention, actuality, competition;
(3)左右结构:好、棚、和、蜂、滩、往、明; (3) Left and right structure: good, shed, and, bee, beach, to, and Ming;
(4)左中右结构:谢、树、倒、搬、撇、鞭、辩;(4) Left, right, and right structures: Xie, tree, inverted, moving, smashing, whip, and arguing;
(5)全包围结构:围、囚、困、田、因、国、固;(5) Fully enclosed structure: encirclement, prisoner, sleepy, Tian, cause, country, solid;
(6)半包围结构:包、区、闪、这、句、函、风;(6) Semi-enclosed structure: package, district, flash, this, sentence, letter, wind;
(7)穿插结构:噩、兆、非;(7) Interspersed structure: 噩, 兆, 非;
(8)品字形结构:品、森、聂、晶、磊、鑫、焱。(8) Character-shaped structure: product, Sen, Nie, Jing, Lei, Xin, Yi.
因此,在五笔输入法中,由于汉字的笔画或偏旁部首与五笔字根的相似性,汉字的结构及其书写规则与五笔拆分规则的相似性,因此分别对形近字进行字根拆分,可以获得相似或相近的编码字符串。例如,“测”和“侧”互为形近字,“测”包括三个偏旁,同时也是字根,分别为“氵”、“贝”、“刂”,其编码字符串为“imjh”,“侧”包括三个偏旁,同时也是字根,分别是“亻”、“贝”、“刂”,其编码字符串为“wmjh”,显然,“imjh”和“wmjh”是很相似的。Therefore, in the Wubi input method, due to the similarity between the strokes of the Chinese characters or the radicals and the five-stroke roots, the structure of the Chinese characters and the similarity between the writing rules and the five-stroke splitting rules, the roots of the near-words are respectively removed. A similar or similar encoded string can be obtained. For example, “measure” and “side” are close to each other, and “test” includes three radicals, which are also radicals, which are “氵”, “贝”, “刂”, and their encoded string is “imjh”. "Side" includes three radicals, which are also radicals, which are "亻", "贝", "刂", and their encoded string is "wmjh". Obviously, "imjh" and "wmjh" are very similar. .
相对应地,对第一文字和第二文字对应的第一编码字符串和第二编码字符串进行编码距离的计算,当其小于预设距离阈值时,表明其相似度较高,可以认为是形近字。相反,当编码距离大于或等于预设距离阈值时,表明其相似度较低,可以认为是非形近字。Correspondingly, the coding distance is calculated for the first code string and the second code string corresponding to the first character and the second character. When the value is smaller than the preset distance threshold, the similarity is high, and the shape may be considered to be a shape. Near word. Conversely, when the coding distance is greater than or equal to the preset distance threshold, it indicates that the similarity is low and can be regarded as a non-close-word.
例如,在五笔输入法中,由于汉字最多为4个编码字符串,则可以预设距离阈值为2。对于文字“候”和“侯”,应用五笔编码规则,“候”的编码字符串为“whnd”,“侯”的编码字符串为“wntd”,“whnd”和“wntd”之间的编码距离为1,小于可以距离阈值2,则可以判定“候”和“侯”互为形近字。For example, in the Wubi input method, since the Chinese character is at most 4 encoded strings, the distance threshold can be preset to 2. For the words "wait" and "hou", apply the five-stroke encoding rule, the encoding string of "waiting" is "whnd", the encoding string of "hou" is "wntd", the encoding between "whnd" and "wntd" If the distance is 1, less than the distance threshold 2, it can be determined that "waiting" and "hou" are mutually similar.
步骤105,在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。Step 105: Establish a near-word mapping relationship between the first text and the second text in the search engine.
在具体实现中,可以在搜索引擎中分别建立字体数据库收集当前文字的形近字及对应的形近字映射关系。In a specific implementation, the font database may be separately established in the search engine to collect the near-words of the current text and the corresponding near-word mapping relationship.
需要说明的是,形近字映射关系可以是相互的。例如第一文字与与第二文字的形近字映射关系可以为第一文字————第二文字;第二文字与第一文字的形近字映射关系可以为第二文字————第一文字。It should be noted that the near-word mapping relationship may be mutual. For example, the first character and the near-word mapping relationship with the second character may be the first character--the second character; the second-character mapping relationship between the second character and the first character may be the second character--the first character.
本发明实施例通过在搜索引擎中计算第一文字的第一编码字符串和第二文字的第二编码字符串之间的编码距离,实现了第一文字和第二文字是否互为形近字的判定,提高了搜索引擎的网页识别效率,增加了搜索引擎的功能。In the embodiment of the present invention, by calculating the coding distance between the first encoded character string of the first character and the second encoded character string of the second character in the search engine, whether the first character and the second character are mutually adjacent to each other is determined. It improves the efficiency of search engine web page recognition and increases the function of search engine.
在本发明的一种优选实施例中,还可以包括如下步骤:In a preferred embodiment of the present invention, the following steps may also be included:
步骤106,将所述互为形近字的第一文字和第二文字及所述形近字映射关系输出至指定的字体数据库中。Step 106: Output the first character and the second character and the near-word mapping relationship of the mutually similar words to a specified font database.
应用本发明实施例,可以在语料库中遍历所有文字,寻找当前文字的形近字,将寻找到的形近字及形近字映射关系生成当前文字的字体数据库。By applying the embodiment of the present invention, all the characters can be traversed in the corpus, the near-word of the current text can be searched, and the searched near-word and near-word mapping relationship can be generated to generate a font database of the current text.
例如第一文字的字体数据库中保存一个或多个形近字及形近字映射关系,如第一文字————第二文字、第三文字、第四文字;第二文字的字体数据库中保存一个或多个形近字及形近字映射关系,如第二文字————第一文字、第五文字、第六文字。For example, the first character font database stores one or more near-word and near-word mapping relationships, such as the first text—the second text, the third text, and the fourth text; and the second text font database stores one Or a plurality of near-word and near-word mapping relationships, such as the second text—the first text, the fifth text, and the sixth text.
参照图2,示出了本发明的一个实施例的一种提供搜索中关键词纠错的方法实施例的步骤流程图,可以包括如下步骤:Referring to FIG. 2, a flow chart of steps of an embodiment of a method for providing keyword error correction in a search according to an embodiment of the present invention is shown, which may include the following steps:
步骤201,接收搜索请求;所述搜索请求中包括搜索关键词;Step 201: Receive a search request, where the search request includes a search keyword;
搜索请求可以是指用户发出的采用某个搜索关键词进行搜索的指示。例如,用户可以通过搜索引擎网页来发出搜索请求,或者在搜索插件来发出搜索请求等等。当用户在搜索引擎的搜索框中输入搜索关键词并点击或按下回车键时,就相当于接 收到了搜索请求;同样,当在搜索插件的输入框中输入搜索关键词并点击或按下回车键时,就相当于接收到了搜索请求。The search request may refer to an instruction issued by the user to search using a certain search keyword. For example, a user may issue a search request through a search engine web page, or in a search plugin to issue a search request, and the like. When the user enters a search keyword in the search box of the search engine and clicks or presses the enter key, it is equivalent to A search request has been received; likewise, when a search keyword is entered in the input box of the search plugin and the user presses or presses the enter key, it is equivalent to receiving the search request.
步骤202,当对所述搜索关键词进行纠错处理发现错误时,采用与所述搜索关键词匹配的形近字对所述搜索关键词进行改写;Step 202: When an error is found in the error correction processing on the search keyword, the search keyword is rewritten by using a near word matching the search keyword;
在具体实现中,可以使用自然语言处理技术(Natural Language Processing,NLP)对搜索关键词进行纠错处理。In a specific implementation, the search keyword may be error-corrected using Natural Language Processing (NLP).
纠错处理一般可以拆分成两个子任务:Error correction processing can generally be split into two subtasks:
1、拼写错误检测(Spelling Error Detection):按照错误类型不同,可以分为Non-word Errors和Real-word Errors。其中,Non-word Errors可以指拼写错误后的词本身就不合法,如错误的将“giraffe”写成“graffe”;Real-word Errors可以指那些拼写错误后的词仍然是合法的情况,如将“there”错误拼写为“three”(形近),将“peace”错误拼写为“piece”(同音),将“two”错误拼写为“too”(同音)。在具体实现中,可以基于噪声信道模型(Noisy Channel Model)等进行拼写纠错;1, spelling error detection (Spelling Error Detection): according to the type of error, can be divided into Non-word Errors and Real-word Errors. Among them, Non-word Errors can mean that the word after spelling is not legal, such as the wrong "giraffe" is written as "graffe"; Real-word Errors can refer to those cases where the spelling error is still legal, such as "there" is spelled "three" (nearly), "peace" is mistakenly spelled as "piece" (same), and "two" is mistakenly spelled as "too" (same). In a specific implementation, spelling correction may be performed based on a Noisy Channel Model or the like;
2、拼写纠错(Spelling Error Correction):对搜索关键词进行纠错,可以进行字词查错,例如对相邻字和字、相邻字和词、相邻词和词之间等的错误进行检查,进而通过形近字映射关系查找字体数据库中与错误处的文字最匹配的形近字对搜索关键词进行改写。2, Spelling Error Correction (Spelling Error Correction): error correction of search keywords, can be used to check words, such as errors between adjacent words and words, adjacent words and words, adjacent words and words The check is performed, and the search keyword is rewritten by searching for the closest word in the font database that matches the text at the error in the near-word mapping relationship.
在本发明的一种优选实施例中,所述形近字可以通过以下方式获得:In a preferred embodiment of the invention, the near-word can be obtained in the following manner:
子步骤S21,确定输入搜索引擎中的待校验是否为形近字的第一文字和第二文字;Sub-step S21, determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word;
子步骤S22,按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Sub-step S22, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
在本发明实施例的一种优选示例中,所述预设规则可以包括预设的编码规则,子步骤S22进一步可以包括如下子步骤:In a preferred example of the embodiment of the present invention, the preset rule may include a preset encoding rule, and the sub-step S22 may further include the following sub-steps:
子步骤S221,按照预设的编码规则计算所述第一文字对应的第一编码字符串;Sub-step S221, calculating a first encoded character string corresponding to the first character according to a preset encoding rule;
子步骤S222,按照所述编码规则计算所述第二文字对应的第二编码字符串;Sub-step S222, calculating a second encoded character string corresponding to the second character according to the encoding rule;
其中,所述预设的编码规则可以包括五笔编码规则。The preset encoding rule may include a five-stroke encoding rule.
子步骤S23,计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Sub-step S23, calculating an encoding distance between the first encoded character string and the second encoded character string;
子步骤S24,当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;Sub-step S24, when the encoding distance is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;
子步骤S25,在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。Sub-step S25, establishing a near-word mapping relationship between the first text and the second text in the search engine.
需要说明的是,在本发明实施例中,由于子步骤S21至子步骤S25与方法实施例1的应用基本相似,所以描述的比较简单,相关之处参见方法实施例1的部分说明即可,本发明实施例在此不加以详述。It should be noted that, in the embodiment of the present invention, since the application of the sub-step S21 to the sub-step S25 is substantially similar to the application of the method embodiment 1, the description is relatively simple, and the relevant part can be referred to the description of the method embodiment 1. The embodiments of the present invention are not described in detail herein.
在本发明的一种优选实施例中,所述形近字可以通过以下方式获得:In a preferred embodiment of the invention, the near-word can be obtained in the following manner:
子步骤S31,确定输入搜索引擎中的待校验是否为形近字的第一文字和第二文字;Sub-step S31, determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word;
子步骤S32,按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Sub-step S32, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
子步骤S33,计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Sub-step S33, calculating an encoding distance between the first encoded character string and the second encoded character string;
子步骤S34,分别查找所述第一编码字符串对应的第一输入按键;Sub-step S34, respectively searching for a first input button corresponding to the first encoded character string;
子步骤S35,分别查找所述第二编码字符串对应的第二输入按键;Sub-step S35, respectively searching for a second input button corresponding to the second encoded character string;
子步骤S36,分别计算所述第一输入按键和所述第二输入按键之间的按键距离;Sub-step S36, respectively calculating a button distance between the first input button and the second input button;
子步骤S37,依据所述按键距离为所述编码距离配置对应的权重; Sub-step S37, configuring a weight corresponding to the coding distance according to the button distance;
子步骤S38,当配置有所述权重的编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;Sub-step S38, when the coded distance configured with the weight is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;
子步骤S39,在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。Sub-step S39, establishing a near-word mapping relationship between the first text and the second text in the search engine.
在本发明实施例中,第一输入按键和第二输入按键之间的按键距离可以为键盘上输入按键的物理距离。In the embodiment of the present invention, the key distance between the first input button and the second input button may be the physical distance of the input button on the keyboard.
在QWERTY键盘的指法中,左手食指控制按键R、T、F、G、V、B,左手中指控制按键E、D、C,左手无名指控制按键W、S、X,左手小指控制按键Q、A、Z,右手食指控制按键Y、U、H、J、N、M,右手中指控制按键I、K,右手无名指控制按键O、L,右手小指控制按键P。其中,按键F、J一般具有凸起,作为定位键。In the fingering of the QWERTY keyboard, the left index finger control buttons R, T, F, G, V, B, left middle finger control buttons E, D, C, left ring finger control buttons W, S, X, left hand pink finger control button Q, A , Z, right index finger control buttons Y, U, H, J, N, M, right middle finger control buttons I, K, right ring finger control buttons O, L, right hand little finger control button P. Among them, the buttons F and J generally have protrusions as positioning keys.
而由于定位键的存在,使得当前手指点击不属于其控制的按键时,例如左手食指点击按键E,手指跨度较大,使得用户一般存在明显不适,进而使得此种误点击的几率很小。反之,在当前手指控制的按键中误点击的几率相对较大,例如左手食指点击按键R,容易误点击T。Due to the presence of the positioning key, when the current finger clicks on a button that does not belong to the control, for example, the left index finger clicks the button E, and the finger span is large, so that the user generally has obvious discomfort, and thus the probability of such a wrong click is small. Conversely, the probability of accidental clicks on the currently controlled finger button is relatively large. For example, the left index finger clicks the button R, and it is easy to accidentally click T.
因此,所述按键距离可以与所述权重成反比。并且,可选地,同一个手指控制的输入按键之间的按键距离可以对权重配置权重系数,降低权重,使得第一文字和第二文字的编码距离更小,即相似度更高,以体现误点击的几率相对较大的特点。Therefore, the button distance can be inversely proportional to the weight. Moreover, optionally, the button distance between the input buttons controlled by the same finger can configure the weight coefficient for the weight, and reduce the weight, so that the coding distance of the first text and the second text is smaller, that is, the similarity is higher, so as to reflect the error. The probability of clicking is relatively large.
步骤203,以改写后的搜索关键词进行搜索,获得与所述改写后的搜索关键词相匹配的搜索结果数据。Step 203: Perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
在搜索关键词改写结束之后,便可以采用全文索引、目录索引等方式进行网络信息的检索匹配。After the search keyword rewriting is completed, the network information can be retrieved and matched by using full-text indexing, directory indexing, and the like.
在本发明的一种优选实施例中,还可以包括如下步骤:In a preferred embodiment of the present invention, the following steps may also be included:
步骤204,根据所述搜索结果数据生成搜索结果页。Step 204: Generate a search result page according to the search result data.
搜索引擎在数据库中进行搜寻,如果找到与用户要求内容相符的网络信息,一般根据网络信息中关键词的匹配程度、出现的位置、频次、链接质量等,计算出各网页的相关度及排名等级,然后根据关联度高低,按顺序将这些网络信息链接返回给用户。The search engine searches in the database. If the network information matching the content requested by the user is found, the relevance and ranking level of each webpage are generally calculated according to the matching degree of the keywords in the network information, the location, the frequency, the link quality, and the like. Then, according to the degree of association, these network information links are returned to the user in order.
在本发明的一种优选实施例中,还可以包括如下步骤:In a preferred embodiment of the present invention, the following steps may also be included:
步骤205,在所述搜索结果页中提示对所述搜索关键词进行纠错的信息。Step 205: Prompt information for correcting the search keyword in the search result page.
在具体实现中,本发明实施例可以采用任意形式进行提示,例如可以在搜索引擎的输入框下提示对所述搜索关键词进行纠错的信息,为增强提示功能,也可以对纠错前的文字和纠错后的文字采用不同的颜色进行标注,等等,本发明实施例对此不加以限制。In a specific implementation, the embodiment of the present invention may be prompted in any form. For example, the information about the error correction of the search keyword may be prompted under the input box of the search engine, and the enhanced prompt function may also be used before the error correction. The text and the error-corrected characters are marked with different colors, and the like, which is not limited by the embodiment of the present invention.
本发明实施例对搜索关键词进行纠错处理,采用与搜索关键词匹配的形近字对搜索关键词进行改写,以获得与所述改写后的搜索关键词相匹配的搜索结果数据。一方面,改写后的搜索关键词使得搜索结果更加接近当初的预期,提升用户体验,减少了客户端的资源和搜索引擎的资源浪费,提高了搜索效率。另一方面,避免用户需要获取自己感兴趣的网页信息,再次在搜索引擎中输入关键词进行搜索,减少了搜索引擎要再次进行海量信息的搜索、对比、筛选等获取与搜索关键词相关的信息,使得用户操作更加方便,减少用户的时间耗费,进一步减少了客户端与搜索引擎的资源耗费。In the embodiment of the present invention, the search keyword is subjected to error correction processing, and the search keyword is rewritten by using a near-word matching the search keyword to obtain search result data that matches the rewritten search keyword. On the one hand, the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency. On the other hand, to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords. The user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
参照图3,示出了根据本发明的一个实施例的一种即时搜索方法实施例的步骤流程图,可以包括如下步骤:Referring to FIG. 3, a flow chart of steps of an embodiment of an instant search method according to an embodiment of the present invention may be included, which may include the following steps:
步骤301,检测搜索栏中当前输入的文字信息; Step 301: Detect text information currently input in the search bar;
需要说明的是,即时搜索(Current Event Search Engine,ISE),又称瞬时搜索,是指以RSS(简易信息聚合)/Atom(一对彼此相关的标准)、Tag(分类标签)等新兴技术为基础,专注于中文世界里频繁更新的博客网站和新闻网站,能够给用户提供接近实时效果的搜索结果。It should be noted that the Instant Search Search Engine (ISE), also known as the instantaneous search, refers to emerging technologies such as RSS (Simple Information Aggregation)/Atom (a pair of related standards) and Tag (category tags). The foundation, focusing on frequently updated blog sites and news sites in the Chinese world, can provide users with near real-time results.
在具体实现中,即时搜索引擎可以检测用户在搜索栏中输入的文字信息,随着用户在搜索栏中输入文字信息,即时搜索引擎可以同时给出搜索结果,随着用户不断的输入新的文字信息,即时搜索引擎可以随时刷新出的搜索结果页面都会一起发生变化。In a specific implementation, the real-time search engine can detect the text information input by the user in the search bar. As the user inputs the text information in the search bar, the real-time search engine can simultaneously give the search result, and the user continuously inputs new text. Information, the search results page that the instant search engine can refresh at any time will change together.
步骤302,对当前输入的文字信息进行纠错处理;Step 302: Perform error correction processing on the currently input text information.
在一种情形中,可以使用自然语言处理技术(Natural Language Processing,NLP)对搜索关键词进行纠错处理。In one case, the search keywords may be error-corrected using Natural Language Processing (NLP).
在另一种情形中,也可以采用语言模型(Language Model)对当前输入的文字信息进行纠错处理。In another case, the language information that is currently input may be error-corrected by using a language model.
即时搜索引擎可以预先采集用户的输入文本信息,然后训练语言模型。训练的模型可以为N-Gram(大词汇连续语音识别中常用的一种语言模型)、基于神经网络的语言模型等等,用户语言模型的学习可以采取定期或者客户端空闲的方式进行。The instant search engine can pre-acquire the user's input text information and then train the language model. The trained model can be N-Gram (a language model commonly used in large vocabulary continuous speech recognition), a neural network-based language model, etc., and the learning of the user language model can be performed in a regular or client-side manner.
当然,上述纠错处理方法只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他纠错处理方法,本发明实施例对此不加以限制。另外,除了上述纠错处理方法外,本领域技术人员还可以根据实际需要采用其它纠错处理方法,本发明实施例对此也不加以限制。Of course, the above-mentioned error correction processing method is only an example. When the embodiment of the present invention is implemented, other error correction processing methods may be set according to actual conditions, which is not limited by the embodiment of the present invention. In addition, in addition to the above-mentioned error correction processing method, those skilled in the art may also adopt other error correction processing methods according to actual needs, which is not limited in the embodiment of the present invention.
步骤303,提供基于当前输入的文字信息反馈的即时搜索结果数据;Step 303, providing real-time search result data based on the currently input text information feedback;
即时搜索中,随着用户每次输入新的文字信息,都可以自动向即时搜索引擎发起查询请求并接收搜索结果展示,而无需点击Enter键等触发查询请求。In the instant search, each time the user inputs new text information, the user can automatically initiate a query request to the instant search engine and receive the search result display without triggering the query request by clicking the Enter key.
步骤304,当对所述文字信息进行纠错处理发现错误时,计算与发现错误的文字信息中包含的字符数据匹配的近似文字;Step 304: When an error is detected in the error correction processing on the text information, an approximate text matching the character data included in the text information found to be erroneous is calculated;
在具体实现中,所述近似字可以包括形近字和/或音近字。In a particular implementation, the approximation word can include a near-word and/or a near-word.
音近字可以为读音相同或相近的词,例如“案”和“安”的读音都为“an”。其中,中文的拼音由声母和韵母组成,可以分别计算第一文字和第二文字的声母和韵母的相似度,获得读音之间的相似度,当该相似度大于预设的相似度阈值时,可以判定第一文字和第二文字为音近字。The pronunciation sound can be the same or similar words, for example, the pronunciation of "case" and "安" is "an". Among them, Chinese pinyin is composed of initials and finals, and the similarity between the initials and finals of the first and second characters can be calculated separately, and the similarity between the pronunciations is obtained. When the similarity is greater than the preset similarity threshold, It is determined that the first character and the second character are near sound words.
对所述文字信息进行纠错处理发现错误时,在字体数据库中查找与错误处的文字对应的上下文最匹配的近似文字对文字信息进行改写。When an error is detected in the error correction processing of the character information, the font information is searched for in the font database to find the closest matching text corresponding to the text at the error.
在本发明的一种优选实施例中,所述形近字可以通过以下方式获得:In a preferred embodiment of the invention, the near-word can be obtained in the following manner:
子步骤S41,确定输入搜索引擎中的待校验的第一文字和第二文字;Sub-step S41, determining the first text and the second text to be verified in the input search engine;
在具体实现中,可以从预设采集的语料库中提取第一文字和第二文字,进行是否互为形近字的校验。In a specific implementation, the first text and the second text may be extracted from a preset collected corpus to perform verification of whether the characters are close to each other.
在本发明实施例的一个可选示例中,第一文字和第二文字可以为汉字。In an optional example of the embodiment of the present invention, the first character and the second character may be Chinese characters.
子步骤S42,按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Sub-step S42, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
在本发明实施例的一种优选示例中,所述预设规则可以包括预设的编码规则,子步骤42进一步可以包括如下子步骤:In a preferred example of the embodiment of the present invention, the preset rule may include a preset encoding rule, and the sub-step 42 may further include the following sub-steps:
子步骤S421,按照预设的编码规则计算所述第一文字对应的第一编码字符串;Sub-step S421, calculating a first encoded character string corresponding to the first character according to a preset encoding rule;
子步骤S422,按照所述编码规则计算所述第二文字对应的第二编码字符串;Sub-step S422, calculating a second encoded character string corresponding to the second character according to the encoding rule;
其中,所述预设的编码规则可以包括五笔编码规则。 The preset encoding rule may include a five-stroke encoding rule.
子步骤S43,计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Sub-step S43, calculating an encoding distance between the first encoded character string and the second encoded character string;
在本发明实施例的一种优选示例中,所述编码距离可以包括编辑距离。In a preferred example of an embodiment of the present invention, the encoding distance may include an editing distance.
子步骤S44,当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字。Sub-step S44, when the encoding distance is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other.
子步骤S45,在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。Sub-step S45, establishing a near-word mapping relationship between the first text and the second text in the search engine.
在本发明的另一种优选实施例中,所述形近字可以通过以下方式获得:In another preferred embodiment of the invention, the near-word can be obtained in the following manner:
子步骤S51,确定输入搜索引擎中的待校验是否为形近字的第一文字和第二文字;Sub-step S51, determining whether the first text and the second character in the input search engine to be verified are in the form of a near-word;
子步骤S52,按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Sub-step S52, acquiring a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
子步骤S23,计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Sub-step S23, calculating an encoding distance between the first encoded character string and the second encoded character string;
子步骤S54,分别查找所述第一编码字符串对应的第一输入按键;Sub-step S54, respectively searching for the first input button corresponding to the first encoded character string;
子步骤S55,分别查找所述第二编码字符串对应的第二输入按键;Sub-step S55, respectively searching for a second input button corresponding to the second encoded character string;
子步骤S56,分别计算所述第一输入按键和所述第二输入按键之间的按键距离;Sub-step S56, respectively calculating a button distance between the first input button and the second input button;
子步骤S57,依据所述按键距离为所述编码距离配置对应的权重;Sub-step S57, configuring a weight corresponding to the coding distance according to the button distance;
子步骤S58,当配置有所述权重的编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;Sub-step S58, when the coded distance configured with the weight is less than the preset distance threshold, determining that the first character and the second character are in close proximity to each other;
子步骤S59,在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。Sub-step S59, establishing a near-word mapping relationship between the first text and the second text in the search engine.
步骤305,在所述即时搜索结果数据插入针对所述发现错误的文字信息进行纠错的推荐近似文字的提示信息;Step 305: insert, in the instant search result data, prompt information of a recommended approximate text for correcting the text information of the found error;
在具体实现中,本发明实施例可以采用任意形式进行提示,例如可以在输入框下提示对推荐近似文字进行纠错提示的信息,为增强提示功能,也可以对纠错前的文字和推荐近似文字采用不同的颜色进行标注,等等,本发明实施例对此不加以限制。In a specific implementation, the embodiment of the present invention may perform prompting in any form. For example, information indicating that the recommended approximate text is corrected may be prompted under the input box, and the enhanced prompt function may also be used for the text and recommendation before the error correction. The text is marked with a different color, and the like, which is not limited by the embodiment of the present invention.
步骤306,当接收到用户对所述提示信息的触发指示时,提供以所述触发指示对应的近似文字进行搜索的即时搜索结果数据。Step 306: When receiving a trigger indication of the prompt information by the user, provide real-time search result data that is searched by the approximate text corresponding to the trigger indication.
触发指示可以是指用户发出的采用某个近似文字进行替换发现错误的文字信息的指示。例如,当用户在提示信息处点击时,就相当于接收到了触发指示。又例如,当用户采用Tab键等按键选择近似文字后按下回车键时,就相当于接收到了触发指示。The trigger indication may refer to an instruction sent by the user to replace the found text message with an approximate text. For example, when the user clicks at the prompt information, it is equivalent to receiving the trigger indication. For another example, when the user selects an approximate text by using a button such as the Tab key and then presses the enter key, it is equivalent to receiving the trigger indication.
当接收到用户对所述提示信息的触发指示时,则可以再次提供基于触发指示替换发现错误后的文字信息反馈的即时搜索结果数据。When the trigger indication of the prompt information by the user is received, the instant search result data of the text information feedback after the error is found based on the trigger indication may be provided again.
本发明实施例在即时搜索引擎中对文字信息进行纠错处理,采用与文字信息匹配的近似文字对搜索关键词进行改写,以获得与所述改写后的文字信息相匹配的搜索结果数据。一方面,改写后的搜索关键词使得搜索结果更加接近当初的预期,提升用户体验,减少了客户端的资源和搜索引擎的资源浪费,提高了搜索效率。另一方面,避免用户需要获取自己感兴趣的网页信息,再次在搜索引擎中输入关键词进行搜索,减少了搜索引擎要再次进行海量信息的搜索、对比、筛选等获取与搜索关键词相关的信息,使得用户操作更加方便,减少用户的时间耗费,进一步减少了客户端与搜索引擎的资源耗费。In the embodiment of the present invention, error correction processing is performed on the text information in the real-time search engine, and the search keyword is rewritten by using approximate text matching the text information to obtain search result data that matches the rewritten text information. On the one hand, the rewritten search keywords make the search results closer to the original expectations, improve the user experience, reduce the resources of the client and the waste of the search engine, and improve the search efficiency. On the other hand, to avoid users need to obtain the webpage information that they are interested in, and then input keywords in the search engine to search again, which reduces the search engine to search, compare, and filter the massive information again to obtain information related to the search keywords. The user operation is more convenient, the user's time consumption is reduced, and the resource consumption of the client and the search engine is further reduced.
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说 明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。For the method embodiments, for the sake of brevity, they are all described as a series of combinations of actions, but those skilled in the art will appreciate that the present invention is not limited by the described order of actions, as some steps are in accordance with the present invention. It can be done in other orders or at the same time. Secondly, those skilled in the art should also know that The embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
参照图4,示出了本发明一个实施例的一种在搜索引擎中确定形近字的装置实施例的结构框图,可以包括如下模块:Referring to FIG. 4, a block diagram of an embodiment of an apparatus for determining a near-word in a search engine according to an embodiment of the present invention is shown, which may include the following modules:
文字确定模块401,适于确定输入搜索引擎中的待校验的第一文字和第二文字;The text determining module 401 is adapted to determine the first text and the second text to be verified in the input search engine;
编码获取模块402,适于按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;The encoding obtaining module 402 is configured to acquire the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;
编码距离计算模块403,适于计算所述第一编码字符串和所述第二编码字符串之间的编码距离;The encoding distance calculation module 403 is adapted to calculate an encoding distance between the first encoded character string and the second encoded character string;
形近字判定模块404,适于在所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;The near-word determining module 404 is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
映射关系确定模块405,适于在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。The mapping relationship determining module 405 is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
在本发明的一种优选实施例中,所述预设规则可以包括预设的编码规则,所述编码获取模块还可以适于:In a preferred embodiment of the present invention, the preset rule may include a preset encoding rule, and the encoding obtaining module may further be configured to:
按照预设的编码规则计算所述第一文字对应的第一编码字符串;Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;
按照所述编码规则计算所述第二文字对应的第二编码字符串;Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;
其中,所述预设的编码规则包括五笔编码规则。The preset encoding rule includes a five-stroke encoding rule.
在本发明的一种优选实施例中,还可以包括如下模块:In a preferred embodiment of the present invention, the following modules may also be included:
输出模块,适于将所述互为形近字的第一文字和第二文字及所述形近字映射关系输出至指定的字体数据库中。And an output module, configured to output the first character and the second character and the near-word mapping relationship of the mutual near-word to a specified font database.
参照图5,示出了本发明一个实施例的一种提供搜索中关键词纠错的装置实施例的结构框图,可以包括如下单元:Referring to FIG. 5, a structural block diagram of an apparatus for providing error correction of keywords in a search according to an embodiment of the present invention is shown, which may include the following units:
接收单元501,适于接收搜索请求;所述搜索请求中包括搜索关键词;The receiving unit 501 is adapted to receive a search request; the search request includes a search keyword;
改写单元502,适于在对所述搜索关键词进行纠错处理发现错误时,采用与所述搜索关键词匹配的形近字对所述搜索关键词进行改写;The rewriting unit 502 is adapted to rewrite the search keyword by using a near-word matching the search keyword when an error is detected in the error correction processing on the search keyword;
搜索单元503,适于以改写后的搜索关键词进行搜索,获得与所述改写后的搜索关键词相匹配的搜索结果数据。The searching unit 503 is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
在本发明的一种优选实施例中,所述形近字可以通过调用以下模块获得:In a preferred embodiment of the invention, the near-word can be obtained by calling the following modules:
文字确定模块,适于确定输入搜索引擎中的待校验的第一文字和第二文字;a text determining module, configured to determine a first text and a second text to be verified in the input search engine;
编码获取模块,适于按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
编码距离计算模块,适于计算所述第一编码字符串和所述第二编码字符串之间的编码距离;a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;
形近字判定模块,适于在所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
映射关系确定模块,适于在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
在本发明的一种优选实施例中,所述预设规则可以包括预设的编码规则,所述编码获取模块还适于:In a preferred embodiment of the present invention, the preset rule may include a preset encoding rule, and the encoding obtaining module is further adapted to:
按照预设的编码规则计算所述第一文字对应的第一编码字符串;Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;
按照所述编码规则计算所述第二文字对应的第二编码字符串; Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;
其中,所述预设的编码规则包括五笔编码规则。The preset encoding rule includes a five-stroke encoding rule.
在本发明的一种优选实施例中,所述形近字还可以通过调用以下模块获得:In a preferred embodiment of the invention, the near-word can also be obtained by calling the following modules:
第一查找模块,适于分别查找所述第一编码字符串对应的第一输入按键;a first search module, configured to separately search for a first input button corresponding to the first encoded character string;
第二查找模块,适于分别查找所述第二编码字符串对应的第二输入按键;a second search module, configured to separately search for a second input button corresponding to the second encoded character string;
按键距离计算模块,适于分别计算所述第一输入按键和所述第二输入按键之间的按键距离;a button distance calculation module, configured to separately calculate a button distance between the first input button and the second input button;
权重配置模块,适于依据所述按键距离为所述编码距离配置对应的权重;a weight configuration module, configured to configure a weight corresponding to the coding distance according to the button distance;
所述形近字判定模块还可以适于:The shape near word determination module may also be adapted to:
当配置有所述权重的编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字。When the coded distance configured with the weight is less than the preset distance threshold, it is determined that the first character and the second character are in close proximity to each other.
在本发明的一种优选实施例中,所述按键距离可以与所述权重成反比。In a preferred embodiment of the invention, the button distance may be inversely proportional to the weight.
在本发明的一种优选实施例中,还可以包括如下模块:In a preferred embodiment of the present invention, the following modules may also be included:
生成单元,适于根据所述搜索结果数据生成搜索结果页。A generating unit is adapted to generate a search result page based on the search result data.
在本发明的一种优选实施例中,还可以包括如下模块:In a preferred embodiment of the present invention, the following modules may also be included:
提示单元,适于在所述搜索结果页中提示对所述搜索关键词进行纠错的信息。The prompting unit is adapted to prompt information for correcting the search keyword in the search result page.
参照图6,示出了本发明一个实施例的一种即时搜索系统实施例的结构框图,可以包括如下模块:Referring to FIG. 6, a structural block diagram of an embodiment of an instant search system according to an embodiment of the present invention is shown, which may include the following modules:
文字信息检测单元601,适于检测搜索栏中当前输入的文字信息;The text information detecting unit 601 is adapted to detect text information currently input in the search bar;
纠错处理单元602,适于对当前输入的文字信息进行纠错处理;The error correction processing unit 602 is adapted to perform error correction processing on the currently input text information;
第一结果提供单元603,适于提供基于当前输入的文字信息反馈的即时搜索结果数据;The first result providing unit 603 is adapted to provide real-time search result data based on the currently input text information feedback;
近似字计算单元604,适于对所述文字信息进行纠错处理发现错误时,计算与发现错误的文字信息中包含的字符数据匹配的近似文字;The approximate word calculation unit 604 is adapted to perform an error correction process on the text information to find an approximate character that matches the character data included in the erroneous text information when an error is found;
纠错提示单元605,适于在所述即时搜索结果数据插入针对所述发现错误的文字信息进行纠错的推荐近似文字的提示信息;The error correction prompting unit 605 is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;
第二结果提供单元606,适于在接收到用户对所述提示信息的触发指示时,提供以所述触发指示对应的近似文字进行搜索的即时搜索结果数据。The second result providing unit 606 is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user.
在本发明的一种优选实施例中,所述近似字可以包括形近字和/或音近字。In a preferred embodiment of the invention, the approximation word may comprise a near-word and/or a near-word.
在本发明的一种优选实施例中,所述形近字可以通过调用以下模块获得:In a preferred embodiment of the invention, the near-word can be obtained by calling the following modules:
文字确定模块,适于确定输入搜索引擎中的待校验的第一文字和第二文字;a text determining module, configured to determine a first text and a second text to be verified in the input search engine;
编码获取模块,适于按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
编码距离计算模块,适于计算所述第一编码字符串和所述第二编码字符串之间的编码距离;a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;
形近字判定模块,适于在所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
映射关系确定模块,适于在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
在本发明的一种优选实施例中,所述预设规则可以包括预设的编码规则,所述编码获取模块还可以适于:In a preferred embodiment of the present invention, the preset rule may include a preset encoding rule, and the encoding obtaining module may further be configured to:
按照预设的编码规则计算所述第一文字对应的第一编码字符串;Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;
按照所述编码规则计算所述第二文字对应的第二编码字符串;Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;
其中,所述预设的编码规则包括五笔编码规则。 The preset encoding rule includes a five-stroke encoding rule.
在本发明的一种优选实施例中,所述形近字还可以通过调用以下模块获得:In a preferred embodiment of the invention, the near-word can also be obtained by calling the following modules:
第一查找模块,适于分别查找所述第一编码字符串对应的第一输入按键;a first search module, configured to separately search for a first input button corresponding to the first encoded character string;
第二查找模块,适于分别查找所述第二编码字符串对应的第二输入按键;a second search module, configured to separately search for a second input button corresponding to the second encoded character string;
按键距离计算模块,适于分别计算所述第一输入按键和所述第二输入按键之间的按键距离;a button distance calculation module, configured to separately calculate a button distance between the first input button and the second input button;
权重配置模块,适于依据所述按键距离为所述编码距离配置对应的权重;a weight configuration module, configured to configure a weight corresponding to the coding distance according to the button distance;
所述形近字判定模块还可以适于:The shape near word determination module may also be adapted to:
当配置有所述权重的编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字。When the coded distance configured with the weight is less than the preset distance threshold, it is determined that the first character and the second character are in close proximity to each other.
在本发明的一种优选实施例中,所述按键距离可以与所述权重成反比。In a preferred embodiment of the invention, the button distance may be inversely proportional to the weight.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的在搜索引擎中确定形近字的和/或提供搜索中关键词纠错的和/或即时搜索设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement determining a near-word in a search engine and/or providing keyword correction in a search in accordance with an embodiment of the present invention. Wrong and/or instant search for some or all of the functionality of some or all of the components. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图7示出了可以实现根据本发明的在搜索引擎中确定形近字、提供搜索中关键词纠错、即时搜索的计算设备,例如应用服务器。该计算设备传统上包括处理器710和以存储器720形式的计算机程序产品或者计算机可读介质。存储器720可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器720具有用于执行上述方法中的任何方法步骤的程序代码731的存储空间730。例如,用于程序代码的存储空间730可以包括分别用于实现上面的方法中的各种步骤的各个程序代码731。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图8所述的便携式或者固定存储单元。该存储单元可以具有与图7的计算设备中的存储器720类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码731’,即可以由例如诸如710之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, FIG. 7 illustrates a computing device, such as an application server, that can implement a near-word in a search engine, provide keyword error correction in a search, and an instant search in accordance with the present invention. The computing device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720. Memory 720 can be an electronic memory such as a flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Memory 720 has a memory space 730 for program code 731 for performing any of the method steps described above. For example, storage space 730 for program code may include various program code 731 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 720 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 731', ie, code readable by a processor, such as 710, that when executed by a computing device causes the computing device to perform each of the methods described above step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利 要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the right In the requirements, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。 In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (20)

  1. 一种在搜索引擎中确定形近字的方法,包括:A method for determining a near-word in a search engine, comprising:
    确定输入搜索引擎中的待校验的第一文字和第二文字;Determining the first text and the second text to be verified in the input search engine;
    按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;
    计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Calculating an encoding distance between the first encoded string and the second encoded string;
    当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;
    在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。A near-word mapping relationship between the first text and the second text is established in the search engine.
  2. 如权利要求1所述的方法,其特征在于,所述预设规则包括预设的编码规则,所述获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串的步骤包括:The method according to claim 1, wherein the preset rule comprises a preset encoding rule, the acquiring a first encoded character string of the first text and a second encoded character string of the second text The steps include:
    按照预设的编码规则计算所述第一文字对应的第一编码字符串;Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;
    按照所述编码规则计算所述第二文字对应的第二编码字符串;Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;
    其中,所述预设的编码规则包括五笔编码规则。The preset encoding rule includes a five-stroke encoding rule.
  3. 如权利要求1或2所述的方法,其特征在于,还包括:The method of claim 1 or 2, further comprising:
    将所述互为形近字的第一文字和第二文字及所述形近字映射关系输出至指定的字体数据库中。And outputting the first character and the second character and the near-word mapping relationship of the mutually near-word to a specified font database.
  4. 一种提供搜索中关键词纠错的方法,包括:A method for providing keyword correction in a search, comprising:
    接收搜索请求;所述搜索请求中包括搜索关键词;Receiving a search request; the search request includes a search keyword;
    当对所述搜索关键词进行纠错处理发现错误时,采用与所述搜索关键词匹配的形近字对所述搜索关键词进行改写;When an error is found in the error correction processing on the search keyword, the search keyword is rewritten by using a near word matching the search keyword;
    以改写后的搜索关键词进行搜索,获得与所述改写后的搜索关键词相匹配的搜索结果数据。The search is performed by the rewritten search keyword, and search result data matching the rewritten search keyword is obtained.
  5. 如权利要求4所述的方法,其特征在于,所述形近字通过以下方式获得:The method of claim 4 wherein said near word is obtained by:
    确定待输入搜索引擎中的校验是否为形近字的第一文字和第二文字;Determining whether the verification to be input into the search engine is a first character and a second text of a near-word;
    按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;
    计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Calculating an encoding distance between the first encoded string and the second encoded string;
    当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;
    在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。A near-word mapping relationship between the first text and the second text is established in the search engine.
  6. 如权利要求5所述的方法,其特征在于,所述预设规则包括预设的编码规则,所述按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串的步骤包括:The method according to claim 5, wherein the preset rule includes a preset encoding rule, and the first encoding character string of the first character and the second text word are acquired according to a preset rule. The second step of encoding the string includes:
    按照预设的编码规则计算所述第一文字对应的第一编码字符串;Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;
    按照所述编码规则计算所述第二文字对应的第二编码字符串;Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;
    其中,所述预设的编码规则包括五笔编码规则。The preset encoding rule includes a five-stroke encoding rule.
  7. 如权利要求5或6所述的方法,其特征在于,所述字体数据库中所述第一文字对应的形近字还通过以下方式获得:The method according to claim 5 or 6, wherein the near-word corresponding to the first character in the font database is also obtained by:
    分别查找所述第一编码字符串对应的第一输入按键;Searching for a first input button corresponding to the first encoded string;
    分别查找所述第二编码字符串对应的第二输入按键;Searching for a second input button corresponding to the second encoded character string;
    分别计算所述第一输入按键和所述第二输入按键之间的按键距离;Calculating a button distance between the first input button and the second input button, respectively;
    依据所述按键距离为所述编码距离配置对应的权重; Setting a weight corresponding to the coding distance according to the button distance;
    所述当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字的步骤为:When the coding distance is less than the preset distance threshold, determining that the first text and the second text are in close proximity to each other are:
    当配置有所述权重的编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字。When the coded distance configured with the weight is less than the preset distance threshold, it is determined that the first character and the second character are in close proximity to each other.
  8. 如权利要求7所述的方法,其特征在于,所述按键距离与所述权重成反比。The method of claim 7 wherein said button distance is inversely proportional to said weight.
  9. 如权利要求4所述的方法,其特征在于,还包括:The method of claim 4, further comprising:
    根据所述搜索结果数据生成搜索结果页。A search result page is generated based on the search result data.
  10. 一种即时搜索方法,包括:An instant search method, including:
    检测搜索栏中当前输入的文字信息,对当前输入的文字信息进行纠错处理,并提供基于当前输入的文字信息反馈的即时搜索结果数据;Detecting the currently input text information in the search bar, performing error correction processing on the currently input text information, and providing real-time search result data based on the currently input text information feedback;
    当对所述文字信息进行纠错处理发现错误时,计算与发现错误的文字信息中包含的字符数据匹配的近似文字;When an error is found in the error correction processing of the text information, an approximate character matching the character data included in the text information found to be erroneous is calculated;
    在所述即时搜索结果数据插入针对所述发现错误的文字信息进行纠错的推荐近似文字的提示信息;And prompting, in the instant search result data, the prompt information of the recommended approximate text for correcting the text information of the found error;
    当接收到用户对所述提示信息的触发指示时,提供以所述触发指示对应的近似文字进行搜索的即时搜索结果数据。When the trigger indication of the prompt information by the user is received, the instant search result data that is searched by the approximate text corresponding to the trigger indication is provided.
  11. 如权利要求10所述的方法,其特征在于,所述近似字包括形近字和/或音近字。The method of claim 10 wherein said approximate word comprises a near-word and/or a near-word.
  12. 如权利要求11所述的方法,其特征在于,所述形近字通过以下方式获得:The method of claim 11 wherein said near word is obtained by:
    确定输入搜索引擎中的待校验是否为形近字的第一文字和第二文字;Determining whether the first text and the second text in the search engine to be verified are in the form of a near word;
    按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;Acquiring the first encoded character string of the first character and the second encoded character string of the second character according to a preset rule;
    计算所述第一编码字符串和所述第二编码字符串之间的编码距离;Calculating an encoding distance between the first encoded string and the second encoded string;
    当所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;Determining, when the coding distance is less than a preset distance threshold, determining that the first character and the second character are in close proximity to each other;
    在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。A near-word mapping relationship between the first text and the second text is established in the search engine.
  13. 一种在搜索引擎中确定形近字的装置,包括:A device for determining a near-word in a search engine, comprising:
    文字确定模块,适于确定输入搜索引擎中的待校验的第一文字和第二文字;a text determining module, configured to determine a first text and a second text to be verified in the input search engine;
    编码获取模块,适于按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
    编码距离计算模块,适于计算所述第一编码字符串和所述第二编码字符串之间的编码距离;a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;
    形近字判定模块,适于在所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
    映射关系确定模块,适于在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
  14. 如权利要求13所述的装置,其特征在于,所述预设规则包括预设的编码规则,所述编码获取模块还适于:The device according to claim 13, wherein the preset rule comprises a preset encoding rule, and the encoding obtaining module is further adapted to:
    按照预设的编码规则计算所述第一文字对应的第一编码字符串;Calculating, according to a preset encoding rule, a first encoded character string corresponding to the first character;
    按照所述编码规则计算所述第二文字对应的第二编码字符串;Calculating, according to the encoding rule, a second encoded character string corresponding to the second character;
    其中,所述预设的编码规则包括五笔编码规则。The preset encoding rule includes a five-stroke encoding rule.
  15. 如权利要求13或14所述的装置,其特征在于,还包括: The device according to claim 13 or 14, further comprising:
    输出模块,适于将所述互为形近字的第一文字和第二文字及所述形近字映射关系输出至指定的字体数据库中。And an output module, configured to output the first character and the second character and the near-word mapping relationship of the mutual near-word to a specified font database.
  16. 一种提供搜索中关键词纠错的装置,包括:A device for providing keyword error correction in a search, comprising:
    接收单元,适于接收搜索请求;所述搜索请求中包括搜索关键词;a receiving unit, configured to receive a search request; the search request includes a search keyword;
    改写单元,适于在对所述搜索关键词进行纠错处理发现错误时,采用与所述搜索关键词匹配的形近字对所述搜索关键词进行改写;The rewriting unit is adapted to rewrite the search keyword by using a shape near word matching the search keyword when an error is detected in the error correction processing of the search keyword;
    搜索单元,适于以改写后的搜索关键词进行搜索,获得与所述改写后的搜索关键词相匹配的搜索结果数据。The search unit is adapted to perform a search by using the rewritten search keyword to obtain search result data that matches the rewritten search keyword.
  17. 如权利要求16所述的装置,其特征在于,所述形近字通过调用以下模块获得:The apparatus of claim 16 wherein said near-word is obtained by invoking the following module:
    文字确定模块,适于确定输入搜索引擎中的待校验的第一文字和第二文字;a text determining module, configured to determine a first text and a second text to be verified in the input search engine;
    编码获取模块,适于按照预设规则获取所述第一文字的第一编码字符串以及所述第二文字的第二编码字符串;a code obtaining module, configured to acquire a first encoded character string of the first character and a second encoded character string of the second character according to a preset rule;
    编码距离计算模块,适于计算所述第一编码字符串和所述第二编码字符串之间的编码距离;a coding distance calculation module, configured to calculate an encoding distance between the first encoded character string and the second encoded character string;
    形近字判定模块,适于在所述编码距离小于预设距离阈值时,判定所述第一文字与所述第二文字互为形近字;The near-word determining module is configured to determine that the first character and the second character are in close proximity to each other when the encoding distance is less than a preset distance threshold;
    映射关系确定模块,适于在搜索引擎中建立第一文字与第二文字之间的形近字映射关系。The mapping relationship determining module is adapted to establish a near-word mapping relationship between the first text and the second text in the search engine.
  18. 一种即时搜索系统,包括:An instant search system comprising:
    文字信息检测单元,适于检测搜索栏中当前输入的文字信息;a text information detecting unit, configured to detect text information currently input in the search bar;
    纠错处理单元,适于对当前输入的文字信息进行纠错处理;The error correction processing unit is adapted to perform error correction processing on the currently input text information;
    第一结果提供单元,适于提供基于当前输入的文字信息反馈的即时搜索结果数据;a first result providing unit, configured to provide real-time search result data based on the currently input text information feedback;
    近似字计算单元,适于对所述文字信息进行纠错处理发现错误时,计算与发现错误的文字信息中包含的字符数据匹配的近似文字;The approximate word calculation unit is adapted to perform error correction processing on the text information, and when calculating an error, calculate an approximate character that matches the character data included in the text information found to be erroneous;
    纠错提示单元,适于在所述即时搜索结果数据插入针对所述发现错误的文字信息进行纠错的推荐近似文字的提示信息;The error correction prompting unit is adapted to insert, in the instant search result data, prompt information of the recommended approximate text for correcting the text information of the found error;
    第二结果提供单元,适于在接收到用户对所述提示信息的触发指示时,提供以所述触发指示对应的近似文字进行搜索的即时搜索结果数据。The second result providing unit is adapted to provide the real-time search result data that is searched by the approximate text corresponding to the trigger indication when receiving the trigger indication of the prompt information by the user.
  19. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-12中的任一个所述的在搜索引擎中确定形近字的方法、提供搜索中关键词纠错的方法、即时搜索方法。A computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform a determination in a search engine according to any of claims 1-12 Near-word method, method for providing keyword error correction in search, and instant search method.
  20. 一种计算机可读介质,其中存储了如权利要求19所述的计算机程序。 A computer readable medium storing the computer program of claim 19.
PCT/CN2014/094933 2014-03-19 2014-12-25 Method and apparatus for determining similar characters in search engine WO2015139497A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201410103601.5 2014-03-19
CN201410104483.X 2014-03-19
CN201410104483.XA CN103927330A (en) 2014-03-19 2014-03-19 Method and device for determining characters with similar forms in search engine
CN201410103601.5A CN103927329B (en) 2014-03-19 2014-03-19 A kind of instant search method and system

Publications (1)

Publication Number Publication Date
WO2015139497A1 true WO2015139497A1 (en) 2015-09-24

Family

ID=54143746

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/094933 WO2015139497A1 (en) 2014-03-19 2014-12-25 Method and apparatus for determining similar characters in search engine

Country Status (1)

Country Link
WO (1) WO2015139497A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629046A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 A kind of fields match method and terminal device
CN109344387A (en) * 2018-08-01 2019-02-15 北京奇艺世纪科技有限公司 The generation method of nearly word form dictionary, device and nearly word form error correction method, device
CN110609859A (en) * 2019-09-19 2019-12-24 惠州市中心人民医院 Intelligent accurate retrieval method based on phrase library
CN110688457A (en) * 2019-09-25 2020-01-14 重庆忽米网络科技有限公司 Steam-massage industry text information input method based on identification analysis
CN111222590A (en) * 2019-12-31 2020-06-02 咪咕文化科技有限公司 Font-near word determining method, electronic device and computer-readable storage medium
CN111368918A (en) * 2020-03-04 2020-07-03 拉扎斯网络科技(上海)有限公司 Text error correction method and device, electronic equipment and storage medium
CN112765962A (en) * 2021-01-15 2021-05-07 上海微盟企业发展有限公司 Text error correction method, device and medium
CN114398463A (en) * 2021-12-30 2022-04-26 南京硅基智能科技有限公司 Voice tracking method and device, storage medium and electronic equipment
CN117831573A (en) * 2024-03-06 2024-04-05 青岛理工大学 Multi-mode-based language barrier crowd speech recording analysis method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241514A (en) * 2008-03-21 2008-08-13 北京搜狗科技发展有限公司 Method for creating error-correcting database, automatic error correcting method and system
CN101710262A (en) * 2009-12-11 2010-05-19 北京搜狗科技发展有限公司 Error correction method and error correction device of characters
CN102135814A (en) * 2011-03-30 2011-07-27 北京搜狗科技发展有限公司 Word input method and system
CN103927330A (en) * 2014-03-19 2014-07-16 北京奇虎科技有限公司 Method and device for determining characters with similar forms in search engine
CN103927329A (en) * 2014-03-19 2014-07-16 北京奇虎科技有限公司 Instant search method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241514A (en) * 2008-03-21 2008-08-13 北京搜狗科技发展有限公司 Method for creating error-correcting database, automatic error correcting method and system
CN101710262A (en) * 2009-12-11 2010-05-19 北京搜狗科技发展有限公司 Error correction method and error correction device of characters
CN102135814A (en) * 2011-03-30 2011-07-27 北京搜狗科技发展有限公司 Word input method and system
CN103927330A (en) * 2014-03-19 2014-07-16 北京奇虎科技有限公司 Method and device for determining characters with similar forms in search engine
CN103927329A (en) * 2014-03-19 2014-07-16 北京奇虎科技有限公司 Instant search method and system

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629046B (en) * 2018-05-14 2023-08-18 平安科技(深圳)有限公司 Field matching method and terminal equipment
CN108629046A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 A kind of fields match method and terminal device
CN109344387A (en) * 2018-08-01 2019-02-15 北京奇艺世纪科技有限公司 The generation method of nearly word form dictionary, device and nearly word form error correction method, device
CN109344387B (en) * 2018-08-01 2023-12-19 北京奇艺世纪科技有限公司 Method and device for generating shape near word dictionary and method and device for correcting shape near word error
CN110609859A (en) * 2019-09-19 2019-12-24 惠州市中心人民医院 Intelligent accurate retrieval method based on phrase library
CN110688457A (en) * 2019-09-25 2020-01-14 重庆忽米网络科技有限公司 Steam-massage industry text information input method based on identification analysis
CN111222590A (en) * 2019-12-31 2020-06-02 咪咕文化科技有限公司 Font-near word determining method, electronic device and computer-readable storage medium
CN111222590B (en) * 2019-12-31 2024-04-12 咪咕文化科技有限公司 Shape-near-word determining method, electronic device, and computer-readable storage medium
CN111368918A (en) * 2020-03-04 2020-07-03 拉扎斯网络科技(上海)有限公司 Text error correction method and device, electronic equipment and storage medium
CN111368918B (en) * 2020-03-04 2024-01-05 拉扎斯网络科技(上海)有限公司 Text error correction method and device, electronic equipment and storage medium
CN112765962A (en) * 2021-01-15 2021-05-07 上海微盟企业发展有限公司 Text error correction method, device and medium
CN114398463A (en) * 2021-12-30 2022-04-26 南京硅基智能科技有限公司 Voice tracking method and device, storage medium and electronic equipment
CN114398463B (en) * 2021-12-30 2023-08-11 南京硅基智能科技有限公司 Voice tracking method and device, storage medium and electronic equipment
CN117831573A (en) * 2024-03-06 2024-04-05 青岛理工大学 Multi-mode-based language barrier crowd speech recording analysis method and system
CN117831573B (en) * 2024-03-06 2024-05-14 青岛理工大学 Multi-mode-based language barrier crowd speech recording analysis method and system

Similar Documents

Publication Publication Date Title
WO2015139497A1 (en) Method and apparatus for determining similar characters in search engine
CN110717031B (en) Intelligent conference summary generation method and system
US10521464B2 (en) Method and system for extracting, verifying and cataloging technical information from unstructured documents
CN105408890B (en) Performing operations related to listing data based on voice input
US10713571B2 (en) Displaying quality of question being asked a question answering system
US11308278B2 (en) Predicting style breaches within textual content
CN103927329B (en) A kind of instant search method and system
US20220012296A1 (en) Systems and methods to automatically categorize social media posts and recommend social media posts
US20140351228A1 (en) Dialog system, redundant message removal method and redundant message removal program
CN111310440B (en) Text error correction method, device and system
CN112417102A (en) Voice query method, device, server and readable storage medium
US20180181544A1 (en) Systems for Automatically Extracting Job Skills from an Electronic Document
CN106708929B (en) Video program searching method and device
CA2932401A1 (en) Systems and methods for in-memory database search
US9495424B1 (en) Recognition of characters and their significance within written works
CN105094368A (en) Control method and control device for frequency modulation ordering of input method candidate item
CN109522397B (en) Information processing method and device
CN109508441B (en) Method and device for realizing data statistical analysis through natural language and electronic equipment
WO2023029513A1 (en) Artificial intelligence-based search intention recognition method and apparatus, device, and medium
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
CN114595686B (en) Knowledge extraction method, and training method and device of knowledge extraction model
US20230214579A1 (en) Intelligent character correction and search in documents
US12008692B2 (en) Systems and methods for digital ink generation and editing
US20200285324A1 (en) Character inputting device, and non-transitory computer readable recording medium storing character inputting program
CN106570196B (en) Video program searching method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14886337

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14886337

Country of ref document: EP

Kind code of ref document: A1