WO2017097075A1 - Fuzzy keyword matching method and apparatus - Google Patents

Fuzzy keyword matching method and apparatus Download PDF

Info

Publication number
WO2017097075A1
WO2017097075A1 PCT/CN2016/104693 CN2016104693W WO2017097075A1 WO 2017097075 A1 WO2017097075 A1 WO 2017097075A1 CN 2016104693 W CN2016104693 W CN 2016104693W WO 2017097075 A1 WO2017097075 A1 WO 2017097075A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
keyword
matching
text
matched
Prior art date
Application number
PCT/CN2016/104693
Other languages
French (fr)
Chinese (zh)
Inventor
李剑
毛宏
Original Assignee
北京搜狗科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京搜狗科技发展有限公司 filed Critical 北京搜狗科技发展有限公司
Publication of WO2017097075A1 publication Critical patent/WO2017097075A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for keyword fuzzy matching.
  • a regular expression is usually used, that is, a single pattern string is used to describe and match a series of strings conforming to the characteristic rule, and the regular expression is generally compiled into a regular expression by the regular matching engine.
  • NFA non-deterministic finite automaton
  • DFA deterministic finite automaton
  • the invention provides a method for keyword fuzzy matching, which can solve the defect that the existing keyword matching efficiency is low to a certain extent.
  • the invention provides the following solutions:
  • a method for fuzzy matching of a keyword comprising: obtaining, for each character in the matched text, a keyword to which the character belongs according to the keyword set and an index bit of the character in the associated keyword; Determining, in the keyword of the keyword, whether the character is the first character of the keyword; if the character is the first character of the keyword, recording the keyword of the character in the matching information set, and The first character of the keyword in the record is stored in the to-be-matched text; if the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, the acquisition is performed.
  • An apparatus for fuzzy matching of a keyword comprising: an obtaining module, configured to acquire each keyword in the text to be matched, and respectively obtain a keyword to which the character belongs according to the keyword set and an index bit of the character in the keyword to be associated a judging module, configured to determine, according to the index bit of the character in the keyword, whether the character is the first character of the keyword, and the first tagging module, when the determining result of the determining module is yes, Recording, in the matching information set, the keyword to which the character belongs, and marking the first character of the keyword in the record in the text to be matched; the second marking module is configured to: when the determining result of the determining module is If the record of the keyword to which the character belongs exists in the matching information set, the record of the keyword to which the character belongs is obtained, and the character in the keyword is marked in the record to be present in the record In the matching text, the determining module is configured to determine that the text to be matched hits when each character in a keyword is marked in the text to be matched Keyword
  • An apparatus for keyword fuzzy matching comprising a memory, and one or more programs, wherein one or more programs are stored in a memory and configured to execute the one or more by one or more processors
  • More than one program includes instructions for: each character in the text to be matched, respectively acquiring a keyword to which the character belongs and an index bit of the character in the associated keyword according to the keyword set; Determining, by the index bit of the character in the keyword, whether the character is the first character of the keyword; if the character is the first character of the keyword, the keyword of the character is recorded in the matching information set, and The first character of the keyword is recorded in the record, and if the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, Obtaining a record of a keyword to which the character belongs, and marking the word in the keyword in the record Present in the text to be matched; when a keyword tag of each character are present in the text to be matched, she is determined that the text to be matched to the hit keyword.
  • a program comprising readable code that, when executed on a server, causes the server to perform a method of keyword fuzzy matching according to any of the embodiments of the present invention.
  • a readable medium in which the program described in the embodiments of the present invention is stored.
  • the present invention discloses the following technical effects:
  • each keyword in the acquired text to be matched is respectively obtained as a keyword to which the character belongs, and the index bit of the character in the associated keyword is determined, if If it is the first character, the keyword of the character is recorded, and the first character of the keyword is found in the text to be matched; if it is not the first character, the keyword of the character is searched for from the recorded keyword.
  • the character in the keyword is present in the text to be matched; when each character in a keyword is marked in the text to be matched, the keyword is hit by the text to be matched.
  • FIG. 1 is a flowchart of a method for keyword fuzzy matching according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for performing keyword fuzzy matching for each character in a text to be matched according to an embodiment of the present invention
  • FIG. 3 is a block diagram of a multi-keyword fuzzy matching device according to an embodiment of the present invention.
  • FIG. 4 is a block diagram of an apparatus for keyword fuzzy matching, according to an exemplary embodiment
  • FIG. 5 is a schematic structural diagram of a server in an embodiment of the present invention.
  • FIG. 6 shows a block diagram of a server for performing a multi-keyword fuzzy matching method according to the present invention
  • Fig. 7 shows a storage unit for holding or carrying program code implementing the multi-keyword fuzzy matching method according to the present invention.
  • the invention provides a method for keyword fuzzy matching, as shown in FIG. 1 , comprising:
  • Step S101 acquiring, for each character in the matched text, a keyword to which the character belongs and an index bit of the character in the associated keyword according to the keyword set;
  • the text to be matched is scanned, and each time a character is scanned, the keyword to which the character belongs is obtained from the keyword set, and an index bit of the character in the keyword is obtained.
  • One character in the text to be matched may correspond to one or more keywords, or may not have corresponding keywords.
  • the method further includes: constructing, for each character of each keyword in the keyword set, a matching rule corresponding to each character, where the matching rule includes: a character, a keyword belonging to the character, and a character included in the keyword belonging to the character. a number, and an index bit of the character in the keyword; obtaining a matching rule corresponding to each character in the keyword, forming a matching rule set corresponding to the keyword; constructing an inverted row from the character to the matching rule set according to the matching rule set direction chart.
  • the inverted index table includes: a character, and all matching rules corresponding to the character; based on the foregoing, the matching rule corresponding to the character may refer to a matching rule including the character.
  • the keyword to which the character belongs and the index bit of the character in the associated keyword are obtained according to the inverted index table. Specifically, each character in the matched text is traversed by the inverted index table, and when the inverted index table includes the character, all matching rules corresponding to the character are obtained, and For each matching rule, the keyword of the character included in the matching rule and the index bit of the character in the associated keyword are respectively obtained.
  • an inverted index table is created, so that all the key points corresponding to the character can be quickly obtained by traversing the inverted index table. Words, as well as determining the index bits of the characters in the respective keywords, no longer need to match the respective keywords, making the matching process easier, faster, and more efficient.
  • the keywords may include wildcards and non-wildcards.
  • non-wildcards are collectively referred to as characters.
  • the keywords may contain one or several wildcards. These wildcards may be consecutive or intervald, and the length of the wildcard. Can be any character length.
  • the keyword set may be added, modified, and deleted. In response, when the keyword set is added, modified, and deleted, the content in the inverted index table is updated according to a specific operation. .
  • Step S102 determining whether the character is the first character of the keyword according to the index bit of the character in the keyword, if the character is the first character of the keyword, step S103 is performed; if it is not the first character, Go to step S104;
  • the characters in the keyword are divided into a first character and a non-first character, wherein the first character refers to the first non-wildcard in the keyword.
  • Non-first characters refer to non-wildcards in the keyword other than the first character. When there is only one non-wildcard in the keyword, the character is the first character.
  • Step S103 Record the keyword to which the character belongs in the matching information set, and mark the first character of the keyword in the record to be in the text to be matched;
  • step S103 if the character is the first character of the keyword, the matching process information corresponding to the keyword to which the character belongs is newly created, and the matching process information is saved in the matching information set; and the character is in the to-be-matched text.
  • the index bit in the record is recorded in the matching process information.
  • the matching process information is in one-to-one correspondence with the keywords to which the characters belong.
  • the index bit of the character in the text to be matched is recorded into the matching process information, and the character in the belonging keyword can be marked to exist in the text to be matched, and the index bit of the character in the text to be matched is recorded. Used to output matching information after subsequent hits of keywords.
  • Step S104 When there is a record of the keyword to which the character belongs in the matching information set, the record of the keyword to which the character belongs is obtained, and the character in the marked keyword is present in the text to be matched in the record. Specifically, in step S104, if the character is not the first character of the keyword, the matching information set is searched, and it is determined whether the record of the keyword belongs to the matching information set, and if yes, the character is acquired. The record of the keyword, and the character in the markup keyword exists in the text to be matched in the record; if it does not exist, the character is ignored, and the next character in the text to be matched is continuously scanned.
  • the determining whether the record of the keyword belongs to the matching information set may be used to determine whether the matching of the keyword belongs to the matching process set.
  • the process information if any, obtains the matching process information corresponding to the keyword to which the character belongs, and records the index bit of the character in the text to be matched into the matching process information; if not, the character is ignored.
  • the record of the keyword to which the character belongs may be obtained, which may be the matching process information corresponding to the keyword to which the character belongs.
  • the index bit of the character in the text to be matched is recorded in the matching process information.
  • the character can be ignored and obtained from the text to be matched. The next character is matched.
  • the matching process information is in one-to-one correspondence with the keyword to which the character belongs, and the number of bits included in each matching process information is the same as the number of characters included in the corresponding keyword; each of the matching process information A character used to mark the corresponding number of bits in the corresponding keyword respectively appears in the text to be matched. If it occurs, the corresponding bit in the matching process information is set to the index bit of the character corresponding to the number of bits in the text to be matched. .
  • each bit in a matching process information is set to the index bit of the character corresponding to the number of bits in the text to be matched, it can be determined that each character of the keyword corresponding to the matching process information is Appearing in the text to be matched indicates that the keyword is hit by the text to be matched.
  • This specific matching process will be described in detail in the subsequent embodiments.
  • the method may further include: outputting the matching information. Specifically, when it is determined that the index bits of all the characters in the corresponding keyword in the matching text are recorded in the matching process information, the text to be matched is determined to hit the keyword, and the character may be in the text to be matched according to each character. The index bit obtains matching information from the text to be matched, and outputs matching information.
  • the matching process information by using the matching process information, it can be determined whether each character in the keyword exists in the text to be matched, and an index bit in the text to be matched.
  • step S104 after the matching process information corresponding to the keyword to which the character belongs is obtained, before the index bit of the character in the text to be matched is recorded in the matching process information, The method further includes: determining whether an index bit of the character in the text to be matched has been recorded in the matching process information, and if the record is already, copying the matching process information, and updating the current index bit in the text to be matched with the character And the step of recording the index bit of the character in the text to be matched in the copied matching process information; if there is no record, performing the step of recording the index bit of the character in the text to be matched into the matching process information.
  • the method further includes: constructing a character distance rule including the number of characters in the keyword and the effective distance between each character and the previous character for each keyword in the keyword set to form a character distance rule set;
  • the matching and verifying of the keyword of the current character according to the character distance rule set includes: obtaining an index bit of the previous character of the current character in the text to be matched from the keyword of the current character, according to the index The index bit of the bit and the current character in the text to be matched, the first distance between the current character and the previous character of the current character is calculated; and the current character and the current character are obtained from the character distance rule corresponding to the keyword of the current character.
  • the verification result is that the addition is successful, and the next character of the current character is obtained.
  • the character matches the keyword it belongs to; if the second distance indicates that it is not any interval length, and the interval length indicated by the second distance is smaller than the interval length indicated by the first distance, the verification result is a failure, the matching process information is invalid, and the end is The match of the keyword to which the character belongs.
  • this embodiment provides a specific example for description, as follows:
  • Each character of each keyword in the keyword set may be separately constructed to include the character, the keyword to which the character belongs, the number of characters included in the keyword to which the character belongs, and the index bit of the character in the keyword to which the keyword belongs.
  • Matching rules to form a matching rule set constructing an inverted index table from a character to a matching rule set according to the matching rule set.
  • each character of each keyword in the keyword set is respectively constructed to include the character, all the keywords of the character, the number of characters included in the keyword to which the character belongs, and the character in each keyword.
  • the matching rule of the index bit, each character of each keyword in the keyword set has at least one matching rule, and the matching rule corresponding to all the characters in each keyword in the keyword set constitutes a matching rule set, All non-repeating characters construct this character into the inverted index table of the matching rule set.
  • a character distance rule including the number of characters in the keyword and the effective distance between each character and the previous character is constructed to form a character distance rule set.
  • Each keyword corresponds to a character distance rule.
  • the number of characters refers to the number of non-wildcards included in the keyword.
  • the effective distance between each character and its previous character refers to the distance between each non-wildcard and its previous non-wildcard.
  • Each character can be continuous with its previous character.
  • the character distance can be set to 0 when continuous.
  • Each character and its previous character can also be discontinuous.
  • the case of discontinuity can be divided into two types: one case: any length interval, which can include a wildcard "*" indicating an arbitrary length between the character and the previous character, and the character distance can be set to -1 at this time;
  • n is a natural number
  • the keyword can be verified by using the character distance rule.
  • the character distance rule set can also be generated when the keyword is matched and checked.
  • the keyword set contains two keywords, of which the keyword one is: generation? ? Open * invoice, keyword two: find? ? proxy.
  • a matching rule is constructed for each character in each keyword, and the matching rule includes the character, the keyword to which the character belongs, the number of characters included in the keyword to which the character belongs, and the index bit of the character in the keyword;
  • the matching rules of each character in the keyword one and the keyword two are constructed, and the matching rule set is formed.
  • Table 1 the mapping relationship between each character and the matching rule set is constructed and numbered, and the inverted index table is obtained. The details are shown in Table 2.
  • Table 1 The following is an example of a matching rule set as shown in Table 1. For example, the keyword "generation? open * invoice”, which includes 4 valid characters, corresponding to the construction of 4 matching rules.
  • the first character (character) "generation”
  • the corresponding keyword (keyword) is "generation? open * invoice”
  • the keyword includes the number of characters (size) is 4, the "generation” index in the keyword (index) is 0.
  • the second character is "on", and the corresponding keyword (keyword) is "generation? open * invoice”.
  • the keyword includes the number of characters (4), and the "open" index position in the keyword. (index) is 1.
  • the third character (character) is "sent", and the corresponding keyword (keyword) is "generation? open * invoice”, the keyword includes the number of characters (size) is 4, "send” the index bit in the keyword (index) is 2.
  • the fourth character (character) "voucher”, the corresponding keyword (keyword) is "generation? open * invoice”, the keyword includes the number of characters (size) is 4, the index of the "ticket” in the keyword (index) is 3.
  • the characters "generation” exist in the keyword “generation? open * invoice” and the keyword “find the agent”, so corresponding to the two matching rules, respectively construct the character “generation” and two matching rules
  • the mapping relationship between them, and numbered, is stored in the inverted index table, as shown in Table 2.
  • the character "on” exists only in the keyword “generation? open * invoice”, so corresponding to a matching rule, the mapping relationship between the character "open” and the matching rule is constructed, and numbered, saved in In the inverted index table, as shown in Table 2.
  • the number of characters is four, and there is two wildcard characters "??" between the second character “on” and the first character “generation”, that is, the character distance is 2.
  • the character distance rule corresponding to the constructed keyword is [4, 2, -1, 0].
  • the character distance rule corresponding to the constructed keyword two is [3, 2, 0].
  • the character distance rule set contains two character distance rules, which are the character distance rule [4, 2, -1, 0] and the keyword “search” of the keyword “generation?
  • each character in the matched text is respectively subjected to keyword fuzzy matching, as shown in FIG. 2, as follows:
  • Step 201 Obtain a character from the text to be matched as the current character
  • the text to be matched is: looking for a cheap agent to open a business invoice. Scan the text to be matched, as follows: Retrieve the character 'seek' as the current character.
  • Step 202 Obtain all matching rules corresponding to the current character from the inverted index table, and determine, for each matching rule, whether the matching process information set of the keyword to which the matching rule belongs is empty, and if the matching process information set is empty, Then, step 203 is performed; if the matching process information set is not empty, step 207 is performed.
  • the matching rule corresponding to the current character is not obtained from the inverted index table, the current character is ignored, and the next character of the current character is obtained from the text to be matched, and the current character is used as the current character. 202.
  • Step 203 Determine whether the current character is the first character of the keyword, if yes, execute step 204; otherwise, execute step 212, that is, obtain the next character of the current character from the text to be matched, and use it as the current character, and then perform step 202.
  • the judgment is performed according to an index bit of the current character recorded in the matching rule in the keyword.
  • Step 204 Add a matching process information to the matching process information set, and record the index bit of the current character in the to-be-matched text into the matching process information, and perform step 205.
  • the matching process information is used to record the index bits of each character in a keyword in the text to be matched.
  • an initial value may be set for each element in the matching process information, and each element in the matching process information represents whether each character in the keyword appears in the text to be matched, and The index bit in the text to be matched.
  • the number of elements included in the matching process information is equal to the number of characters included in the corresponding keyword.
  • each element indicates whether each character in the keyword is included in the text to be matched. If not, the element corresponding to the character is an initial value, and if so, the element corresponding to the character is the character in the text to be matched. Index bit.
  • each element in the newly created matching process information is set to an initial value.
  • the initial value of each element in the newly created matching process information is set to -1, indicating that each character in the keyword does not appear in the text to be matched, for example, the keyword "find?? proxy" has three characters, and is created.
  • Step 205 Determine whether the matching is completed according to the matching process information. If the matching is completed, go to step 206. If the matching is not completed, execute step 212 to obtain the next character of the current character from the text to be matched, and use it as the current character. Step 202 is performed again.
  • the determining whether the matching is completed according to the matching process information may include: determining whether there is an element whose value is an initial value in the matching process information, and if yes, indicating that the matching is not completed, otherwise indicating that the matching is completed.
  • the element corresponding to the character in the matching process information corresponding to the keyword is set as the character in the text to be matched.
  • the index bit otherwise, the element corresponding to the character retains the initial value. Therefore, when all the elements included in the matching process information are non-initial values, it may be determined that all the characters in the keyword are included in the to-be-matched text, that is, the matching is completed.
  • Step 206 Output matching information according to the matching process information.
  • the two index bits are obtained from the text to be matched.
  • the character including the characters on the two index bits, is used as matching information to output the matching information.
  • Step 207 Determine whether the current character is the first character of the keyword, if yes, go to step 208; otherwise, go to step 209.
  • Step 208 Add a new matching process information in the matching process information, record the index bit of the current character in the text to be matched into the matching process information, and perform step 205.
  • step 209 the matching process information corresponding to all the keywords of the current character is obtained from the matching process information set, and the index bit of the current character in the text to be matched is recorded in each matching process information, and step 210 is performed.
  • the method further includes: Determining whether the index bit of the current character in the text to be matched has been recorded in the corresponding matching process information, and copying the current matching process information, and updating the index bit of the current character in the text to be matched to the matching process after copying In the information; otherwise, the index bit of the current character in the text to be matched is recorded into each matching process information, and step 210 is performed.
  • Step 210 Perform a distance check on each matching process information according to the character distance rule. If the verification result is successful, step 205 is performed; if the verification result is a failure, step 211 is performed.
  • step 211 the keyword is marked as invalid, that is, the matching of the keyword to which the current character belongs is ended.
  • Step 212 Obtain the next character of the current character from the text to be matched, and use it as the current character; then perform step 202.
  • each character in the keyword has already appeared in the text to be matched, and whether the character distance rule corresponding to the keyword and the index bit in the text to be matched in the keyword match the pre-pre- Set the relationship to determine the case where the characters match.
  • the method can be implemented as follows: obtaining a character distance rule corresponding to a keyword to which the current character belongs, and obtaining an index bit of a character of the current character in the to-be-matched text from the matching process information corresponding to the keyword to which the character belongs, according to the index bit and The index of the current character in the text to be matched, the first distance between the current character and the previous character of the current character is calculated, and the current character and the previous character of the current character are obtained from the character distance rule corresponding to the keyword of the current character.
  • the second distance between the two when the second distance indicates the length of any interval, or the second distance is greater than the first distance, it is determined whether there is an element with an initial value in the matching process information, and the addition indicates that the adding is successful, from the to-be-matched Gets the next character of the current character in the text and matches it as the current character; otherwise, the match is completed. If the second distance indicates that the interval length is not any interval, and the interval length indicated by the second distance is smaller than the interval length indicated by the first distance, indicating that the keyword matching of the current character is invalid, and ending the matching of the keyword may be matched from Get the next character of the current character in the text to match.
  • the fuzzy matching is performed by matching the text "Looking for a cheap agent to open a business invoice", for example:
  • the text to be matched is: “Looking for a cheap agent to open a business invoice”. Scan the text to be matched, as follows:
  • the corresponding matching rule 6 is found in the inverted index table, and the current character 'find' is the first character of the matching rule 6 corresponding to the keyword "finding agent".
  • Create a matching process information the process information number is 1, and use the current character "find” to replace the initial value of the corresponding element in the matching process information corresponding to the keyword "find the proxy” in the index bit 1 of the text to be matched, as shown in Table 3. Show:
  • the matching process information corresponding to the process information number 1 is verified, and the process information index[]:[1,-1,-1] is matched, and only the first bit is non-1, and the data of other bits are initial values. -1, at this time, indicates that the temporary match is not completed. For the current character "find”, it is the first character in its keyword “find the proxy”, there is no corresponding character distance rule, so it is no longer necessary to check according to the character distance rule.
  • the character ‘Yes’ is retrieved, and the index bit in the text to be matched is 2, and the corresponding matching rule is not found from the inverted index table, and is ignored.
  • the character ‘Yes’ is retrieved, and the index bit in the text to be matched is 3, and the corresponding matching rule is not found from the inverted index table, and is ignored.
  • the character 'generation' is retrieved, and the index bit in the text to be matched is 4, and the corresponding matching rules 1 and 2 are found from the inverted index table.
  • the matching rule 1 is processed, and the current character 'generation' is the first character of the matching rule 1 corresponding keyword "generation? invoice”.
  • the matching process information is newly created.
  • the process information number is 2, and the initial value of the corresponding element in the matching process information 2 is replaced by the index bit 4 in the text to be matched with the current character "generation", as shown in Table 4:
  • the matching process information corresponding to the process information number 2 is checked, and only the first bit of the matching process information index[]:[4,-1,-1,-1] is a non-initial value -1, and the remaining two bits are The data is all initial value -1, so the match is not completed yet.
  • the matching rule 2 is processed, and the current character 'generation' is not the first character of the keyword corresponding to the matching rule 2, and the matching process is not newly created.
  • the corresponding keyword "find the proxy" already has the matching process information 1. Therefore, the initial value of the corresponding element in the matching process information 1 is replaced by the index bit 4 in the text to be matched with the current character "generation".
  • the matching process information table is shown in Table 5:
  • the character distance rule set is [3, 2, 0]; in this keyword, the character distance rule between 'find' and 'generation' is 2,
  • the distance between the two characters is represented, that is, the second distance is equal to 2, so the first distance and the second distance are in accordance with the character distance rule of the character 'generation' in the keyword "finding agent".
  • the last bit is -1, so the match is not successful.
  • the corresponding matching rule 7 is found from the inverted index table.
  • the current character 'ration' is not the first character of the keyword corresponding to the matching rule 7, and no new matching process is created.
  • the corresponding keyword "find the proxy" already has matching process information 1, so the index bit 5 of the current character "reason" in the text to be matched is updated to the matching process information 1, and the updated matching process information table As shown in Table 6:
  • the corresponding matching rule 3 is found from the inverted index table.
  • the current character 'on' is not the first character of the keyword corresponding to the matching rule 3, and no new matching process is performed.
  • the corresponding keyword "generation? invoice” has matching process information 2, therefore, the index bit 6 of the current character "on” in the text to be matched is updated to the matching process information 2, and the updated matching process
  • Table 7 The information table is shown in Table 7:
  • the character ‘battalion' is retrieved, and the index bit in the text to be matched is 7, and the corresponding matching rule is not found from the inverted index table, and is ignored.
  • the character ' industry' is retrieved, and the index bit in the text to be matched is 8, and the corresponding matching rule is not found from the inverted index table, and is ignored.
  • the character 'fat' is retrieved, and the index bit in the text to be matched is 9, and the corresponding matching rule 4 is found from the inverted index table.
  • the current character 'send' is not the first character of the keyword corresponding to the matching rule 4, and the new matching process is not created.
  • the corresponding keyword "generation? invoice” has matching process information 2, therefore, the index bit 9 of the current character "send" in the text to be matched is updated to the matching process information 2, and the updated matching process
  • Table 8 The information table is shown in Table 8:
  • the character distance rule between 'on' and 'fat' in this keyword is -1, indicating that the distance between two characters is the second distance is an arbitrary interval, indicating that the addition is successful. And the last bit is -1, so the match is not completed yet.
  • the character 'ticket' is retrieved, and the index bit in the text to be matched is 10, and the corresponding matching rule 5 is found from the inverted index table.
  • the current character 'ticket' is not the first character of the keyword corresponding to the matching rule 5, and the new matching process is not created.
  • the corresponding keyword "generation? invoice" has matching process information 2, therefore, the index bit of the current character in the text to be matched is updated to the matching process information 2, and the updated matching process information table is as shown in the table. 9 shows:
  • the character distance rule between 'on' and 'send' is -1, indicating that the distance between two characters, that is, the second distance is any interval, indicating that the addition is successful, in line with the key
  • all the bits of the matching process information are not -1, so the matching is completed, and the character string "proxy opening business invoice" is hit according to the first bit and the last bit of the matching process information.
  • the keyword is “generation??* invoice”
  • the text to be matched is “Beijing agent opens a business to open a tax invoice”.
  • the matching process information is updated, and the second bit of the matching process information array is updated to 4, that is, the matching process information array is [2, 4, -1, -1 ].
  • the second bit of the matching process information array in the existing matching process information has an existing value (non-initial value -1), and a new match is copied at this time.
  • Process information which matches the array of process information [2, 7, -1, -1].
  • the character to be matched has a certain keyword, but it is a non-first character in the keyword to be matched, for example, the keyword is “generation??* invoice” and “open* ticket”.
  • the text to be matched is “Beijing Opens a Tax Invoice”.
  • a matching rule is established for each character, and an inverted index table is created.
  • the keyword whose first character is not in the text to be matched is filtered out. It is not necessary to exhaust all the keywords to make the matching keyword operation easier.
  • the embodiment provides a device for multi-keyword fuzzy matching. As shown in FIG. 3, the method includes: an obtaining module 301, a determining module 302, a first marking module 303, a second marking module 304, and a determining module 305.
  • the obtaining module 301 is configured to obtain, for each character in the text to be matched, a keyword that belongs to the character and an index bit of the character in the keyword that belongs to the keyword according to the keyword set;
  • the determining module 302 is configured to determine, according to an index bit of the character in the keyword that belongs to the keyword, whether the character is the first character of the keyword that belongs to the keyword;
  • the first marking module 303 is configured to: when the determination result of the determining module is yes, record the keyword to which the character belongs in the matching information set, and mark the first character of the keyword in the record In the matching text;
  • a second marking module 304 configured to: when the determination result of the determining module is negative and the record of the keyword belongs to the matching information set, obtain a record of the keyword to which the character belongs, and Marking, in the record, the character in the keyword exists in the text to be matched;
  • the determining module 305 is configured to determine that the to-be-matched text hits the keyword when each character in a keyword is marked in the to-be-matched text.
  • the apparatus may further include: a matching rule building module, a matching rule set building module, and an inverted index building module.
  • the matching rule construction module is configured to respectively construct a matching rule corresponding to each character for each character of each keyword in the keyword set;
  • the matching rule includes: a character, a keyword of the character, and a The number of characters included in the keyword to which the character belongs, and the index bit of the character in the associated keyword;
  • the matching rule set construction module is configured to acquire a matching rule corresponding to each character in the keyword, and form a matching rule set corresponding to the keyword;
  • the inverted index construction module is configured to construct an inverted index table from the character to the matching rule according to the matching rule set; the inverted index table includes: a character, and all matches corresponding to the character rule.
  • the obtaining module 301 may include: a traversal unit and a first acquiring unit.
  • the traversing unit is configured to traverse the inverted index table, and when the inverted index table includes the character, acquire all matching rules corresponding to the character;
  • the first obtaining unit is configured to acquire, for each matching rule, a keyword of the character included in the matching rule, and an index bit of the character in the associated keyword.
  • the first marking module 303 may include: a first recording unit.
  • the first recording unit is configured to: when the determination result of the determining module is yes, create matching process information corresponding to the keyword to which the character belongs, and save the matching process information in the matching process information set; The index bit of the character in the text to be matched is recorded in the matching process letter.
  • the second marking module 304 may include: a second recording unit.
  • the second recording unit is configured to: when the determination result of the determining module is negative, search for a matching information set, and determine whether there is matching process information corresponding to the keyword to which the character belongs in the matching process set, if yes, And acquiring matching process information corresponding to the keyword to which the character belongs, and recording an index bit of the character in the to-be-matched text into the matching process information.
  • the second recording unit may further include: a determining subunit, a copy updating subunit, and an index bit recording subunit.
  • the determining subunit is configured to determine whether an index bit of the character in the to-be-matched text has been recorded in the matching process information corresponding to the keyword to which the character belongs;
  • the copy update subunit is configured to: when the judgment result of the judgment subunit is YES, copy the matching process information corresponding to the keyword to which the character belongs, and use the current index bit of the character in the to-be-matched text. Updating an index bit of the character recorded in the matching process information in the to-be-matched text;
  • the index bit recording subunit is configured to: when the determination result of the determining subunit is negative, perform matching process information corresponding to the keyword to which the character belongs, and the character is in the to-be-matched text.
  • the index bit is recorded to the step in the matching process information.
  • each character in the keyword is marked in the to-be-matched text, and each bit in the matching process information corresponding to the keyword to which the character belongs is set to correspond.
  • the index bit of the character of the number of bits in the text to be matched.
  • the apparatus may further include: an output module.
  • the output module is configured to: after the determining module determines that the to-be-matched text hits the keyword, obtain matching information from the to-be-matched text according to an index bit of each character in the text to be matched, and output the Match information.
  • the apparatus may further include: a character distance construction module and a matching verification module.
  • the character distance construction module is configured to separately construct a character distance rule for each keyword in the keyword set to form a character distance rule set, where the character distance rule includes: a number of characters included in the keyword, the key The effective distance between each character in the word and its previous character;
  • the matching check module is configured to perform distance matching check on the keyword to which the current character belongs according to the character distance rule set when the matching text is matched.
  • the matching verification module may include: a second obtaining unit, a third obtaining unit, a first checking unit, and a second checking unit.
  • the second obtaining unit is configured to obtain, from the keyword to which the current character belongs, an index bit of the previous character of the current character in the to-be-matched text, and the to-be-matched according to the previous character of the current character Calculating a first distance between the current character and a previous character of the current character by using an index bit in the text and an index bit of the current character in the to-be-matched text;
  • the third obtaining unit is configured to obtain, as a second distance, an effective distance between the current character and a previous character of the current character from a character distance rule corresponding to a keyword to which the current character belongs;
  • the first checking unit is configured to: if the second distance represents any interval length, or the second distance is greater than the first distance, indicating that the distance verification is successful, acquiring the next character of the current character Match
  • the second check unit is configured to: if the second distance indicates that it is not any interval length, and the second distance is smaller than the first distance, indicating that the distance check fails, the matching process information Invalidation, ending the matching of the keywords to which the character belongs.
  • FIG. 4 is a block diagram of an apparatus 800 for keyword fuzzy matching, according to an exemplary embodiment.
  • device 800 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • apparatus 800 can include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, And a communication component 816.
  • Processing component 802 typically controls the overall operation of device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • Processing component 802 can include one or more processors 820 to execute instructions to perform all or part of the steps of the above described methods.
  • processing component 802 can include one or more modules to facilitate interaction between component 802 and other components.
  • processing component 802 can include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
  • Memory 804 is configured to store various types of data to support operation at device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 804 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Electrically erasable programmable read only memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • Power component 806 provides power to various components of device 800.
  • Power component 806 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 800.
  • the multimedia component 808 includes a screen between the device 800 and the user that provides an output interface.
  • the screen can include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 810 is configured to output and/or input an audio signal.
  • the audio component 810 includes a microphone (MIC) that is activated when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. Configured to receive external audio signals.
  • the received audio signal may be further stored in memory 804 or transmitted via communication component 816.
  • the audio component 810 also includes a speaker for outputting an audio signal.
  • the I/O interface 812 provides an interface between the processing component 802 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
  • Sensor assembly 814 includes one or more sensors for providing device 800 with a status assessment of various aspects.
  • sensor assembly 814 can detect an open/closed state of device 800, a relative positioning of components, such as the display and keypad of device 800, and sensor component 814 can also detect a change in position of one component of device 800 or device 800. The presence or absence of user contact with device 800, device 800 orientation or acceleration/deceleration, and temperature variation of device 800.
  • Sensor assembly 814 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor assembly 814 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 816 is configured to facilitate wired or wireless communication between device 800 and other devices.
  • the device 800 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 816 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • the communication component 816 also includes a near field communication (NFC) module to facilitate short range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
  • non-transitory computer readable storage medium comprising instructions, such as a memory 804 comprising instructions executable by processor 820 of apparatus 800 to perform the above method.
  • the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
  • a non-transitory computer readable storage medium when instructions in the storage medium are executed by a processor of a mobile terminal, enabling the mobile terminal to perform a method for keyword fuzzy matching, the method comprising: treating Matching each character in the text, respectively acquiring a keyword to which the character belongs and an index bit of the character in the keyword according to the keyword set; determining the character according to the index bit of the character in the keyword Whether it is the first character of the keyword; if the character is the first character of the keyword, the keyword belonging to the character is recorded in the matching information set, and the first character of the keyword is marked in the record.
  • FIG. 5 is a schematic structural diagram of a server in an embodiment of the present invention.
  • the server 1900 can vary considerably depending on configuration or performance, and can include one or more central processing units (CPUs) 1922 (eg, one or more processors) and memory 1932, one or one The above storage medium 1942 or storage medium 1930 of data 1944 (eg, one or one storage device in Shanghai).
  • the memory 1932 and the storage medium 1930 may be short-term storage or persistent storage.
  • the program stored on storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations in the server.
  • central processor 1922 can be configured to communicate with storage medium 1930, which performs a series of instruction operations in storage medium 1930.
  • the embodiment of the present invention further provides a program, including a readable code, when the readable code is run on a server, causing the server to perform the keyword fuzzy matching method according to any one of the embodiments of the present invention.
  • a program including a readable code, when the readable code is run on a server, causing the server to perform the keyword fuzzy matching method according to any one of the embodiments of the present invention.
  • a readable medium in which a program as described in an embodiment of the present invention is stored.
  • FIG. 6 shows a server that can implement the keyword fuzzy matching method according to the present invention.
  • the server conventionally includes a processor 1610 and a program product or readable medium in the form of a memory 1620.
  • the memory 1620 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, or a ROM.
  • Memory 1620 has a memory space 1630 for program code 1631 for performing any of the method steps described above.
  • storage space 1630 for program code may include various program code 1631 for implementing various steps in the above methods, respectively.
  • These program codes can be read from or written to one or more program products.
  • These program products include program code carriers such as memory cards.
  • Such a program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have a storage section, a storage space, and the like arranged similarly to the storage 1620 in the server of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes readable code 1631', i.e., code that can be read by, for example, a processor such as 1610, which when executed by the server causes the server to perform various steps in the methods described above.
  • Server 1900 may also include one or more power sources 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941.
  • power sources 1926 For example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Abstract

A fuzzy keyword matching method and apparatus. The method comprises: acquiring a keyword to which each character in a to-be-matched text pertains, and an index bit of the character in the pertaining keyword according to a keyword set respectively (101); judging the index bit of the character in the pertaining keyword (102); if the character is a first character, recording the pertaining keyword of the character, and marking the first character in the keyword into the to-be-matched text (103); and if the character is not the first character and when recorded keywords exist, searching the recorded keywords for the pertaining keyword of the character, and marking the character in the keyword into the to-be-matched text (104). When each character in a keyword is marked into the to-be-matched text, it is determined that the to-be-matched text hits the keyword. The method and apparatus can overcome the defect in the prior art of low keyword matching efficiency to a certain extent.

Description

一种关键词模糊匹配的方法及装置Method and device for keyword fuzzy matching
本申请要求在中国申请的申请号为201510921094.0、申请日为2015年12月11日、发明名称为“一种关键词模糊匹配的方法及装置”的发明专利申请的全部优先权。The present application claims the entire priority of the invention patent application filed in the Chinese application No. 201510921094.0, the filing date is December 11, 2015, and the invention title is "a method and device for fuzzy matching of keywords".
技术领域Technical field
本发明涉及互联网技术领域,特别是涉及一种关键词模糊匹配的方法及装置。The present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for keyword fuzzy matching.
背景技术Background technique
随着互联网技术的迅猛发展,网络传播也得到了飞速发展。网络传播正在以不可抵挡的势头,迅速渗透到世界各国政治、经济、思想以及文化等诸多领域,改变着人们的生活,改变着社会面貌。然而,网络传播不可避免地带来了一系列负面作用,如意识形态和文化渗透、假新闻假信息传播、个人隐私遭到侵犯等等。With the rapid development of Internet technology, network communication has also developed rapidly. Network communication is rapidly infiltrating into the political, economic, ideological and cultural fields of the world with irresistible momentum, changing people's lives and changing the face of society. However, online communication inevitably brings a series of negative effects, such as ideology and cultural penetration, fake news and information dissemination, and violation of personal privacy.
在管理网络传播的过程中,通常涉及到文本内容分析、网络舆情监测、入侵检测以及信息检索等领域,关键词模糊匹配技术广泛应用于上述领域中。In the process of managing network communication, it usually involves text content analysis, network public opinion monitoring, intrusion detection and information retrieval. Keyword fuzzy matching technology is widely used in the above fields.
在现有技术中,关键词模糊匹配时,通常采用正则表达式的方式,即通过单个模式字符串来描述和匹配一系列符合特性规则的字符串,一般通过正则匹配引擎将正则表达式编译为非确定型有穷自动机(NFA)或确定型有穷自动机(DFA),再对文本进行扫描和匹配,通常正则表达式冗长,而且匹配关键词的效率低,匹配过程耗时比较长,效率低,造成网络资源浪费。In the prior art, when a keyword is fuzzy matched, a regular expression is usually used, that is, a single pattern string is used to describe and match a series of strings conforming to the characteristic rule, and the regular expression is generally compiled into a regular expression by the regular matching engine. Non-deterministic finite automaton (NFA) or deterministic finite automaton (DFA), then scan and match text, usually regular expressions are verbose, and the efficiency of matching keywords is low, and the matching process takes a long time. Inefficient, resulting in wasted network resources.
发明内容Summary of the invention
本发明提供了一种关键词模糊匹配的方法,在一定程度上能够解决现有关键词匹配效率低的缺陷。The invention provides a method for keyword fuzzy matching, which can solve the defect that the existing keyword matching efficiency is low to a certain extent.
本发明提供了如下方案:The invention provides the following solutions:
一种关键词模糊匹配的方法,包括:对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;依据所述字符在所属关键词中的索引位判断所述字符是否为所述关键词的首字符;如果所述字符是所属关键词的首字符,则在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;如果所述字符不是所属关键词的首字符,且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述待匹配文本中;当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。 A method for fuzzy matching of a keyword, comprising: obtaining, for each character in the matched text, a keyword to which the character belongs according to the keyword set and an index bit of the character in the associated keyword; Determining, in the keyword of the keyword, whether the character is the first character of the keyword; if the character is the first character of the keyword, recording the keyword of the character in the matching information set, and The first character of the keyword in the record is stored in the to-be-matched text; if the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, the acquisition is performed. a record of the keyword to which the character belongs, and marking the character in the keyword in the record in the text to be matched; when each character in a keyword is marked in the to-be-presented When the text is matched, it is determined that the text to be matched hits the keyword.
一种关键词模糊匹配的装置,包括:获取模块,用于对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;判断模块,用于依据所述字符在所属关键词中的索引位判断所述字符是否为所属关键词的首字符;第一标记模块,用于当所述判断模块的判断结果为是时,在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;第二标记模块,用于当所述判断模块的判断结果为否且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述待匹配文本中;确定模块,用于当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。An apparatus for fuzzy matching of a keyword, comprising: an obtaining module, configured to acquire each keyword in the text to be matched, and respectively obtain a keyword to which the character belongs according to the keyword set and an index bit of the character in the keyword to be associated a judging module, configured to determine, according to the index bit of the character in the keyword, whether the character is the first character of the keyword, and the first tagging module, when the determining result of the determining module is yes, Recording, in the matching information set, the keyword to which the character belongs, and marking the first character of the keyword in the record in the text to be matched; the second marking module is configured to: when the determining result of the determining module is If the record of the keyword to which the character belongs exists in the matching information set, the record of the keyword to which the character belongs is obtained, and the character in the keyword is marked in the record to be present in the record In the matching text, the determining module is configured to determine that the text to be matched hits when each character in a keyword is marked in the text to be matched Keywords above.
一种用于关键词模糊匹配的装置,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;依据所述字符在所属关键词中的索引位判断所述字符是否为所述关键词的首字符;如果所述字符是所属关键词的首字符,则在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;如果所述字符不是所属关键词的首字符,且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述待匹配文本中;当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。An apparatus for keyword fuzzy matching, comprising a memory, and one or more programs, wherein one or more programs are stored in a memory and configured to execute the one or more by one or more processors More than one program includes instructions for: each character in the text to be matched, respectively acquiring a keyword to which the character belongs and an index bit of the character in the associated keyword according to the keyword set; Determining, by the index bit of the character in the keyword, whether the character is the first character of the keyword; if the character is the first character of the keyword, the keyword of the character is recorded in the matching information set, and The first character of the keyword is recorded in the record, and if the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, Obtaining a record of a keyword to which the character belongs, and marking the word in the keyword in the record Present in the text to be matched; when a keyword tag of each character are present in the text to be matched, she is determined that the text to be matched to the hit keyword.
一种程序,包括可读代码,当所述可读代码在服务器上运行时,导致所述服务器执行根据本发明实施例中的任一个所述的关键词模糊匹配的方法。A program comprising readable code that, when executed on a server, causes the server to perform a method of keyword fuzzy matching according to any of the embodiments of the present invention.
一种可读介质,其中存储了本发明实施例所述的程序。A readable medium in which the program described in the embodiments of the present invention is stored.
根据本发明提供的具体实施例,本发明公开了以下技术效果:According to a specific embodiment provided by the present invention, the present invention discloses the following technical effects:
通过本发明的一种关键词模糊匹配的方法,对获取到的待匹配文本中的每个字符,分别获取该字符所属的关键词,并判断所述字符在所属关键词中的索引位,如果是首字符,则记录所述字符所属关键词,标记所述关键词的首字符存在于所述待匹配文本中;如果不是首字符,则从已记录的关键词中查找所述字符所属关键词,标记所述关键词的中的所述字符存在于所述待匹配文本中;当一关键词中的各字符都被标记存在于待匹配文本中时,该关键词被待匹配文本命中。通过这种方法对关键词中每个字符匹配,过滤掉首字符不在待匹配文本中的关键词的方式,不必穷举每个关键词,能够快速地命中关键词,节省时间,提高效率。By using a keyword fuzzy matching method of the present invention, each keyword in the acquired text to be matched is respectively obtained as a keyword to which the character belongs, and the index bit of the character in the associated keyword is determined, if If it is the first character, the keyword of the character is recorded, and the first character of the keyword is found in the text to be matched; if it is not the first character, the keyword of the character is searched for from the recorded keyword. The character in the keyword is present in the text to be matched; when each character in a keyword is marked in the text to be matched, the keyword is hit by the text to be matched. By this method, each character in the keyword is matched, and the keyword whose first character is not in the text to be matched is filtered out, and each keyword is not exhausted, and the keyword can be quickly hit, saving time and improving efficiency.
附图说明 DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the embodiments will be briefly described below. Obviously, the drawings in the following description are only some of the present invention. For the embodiments, those skilled in the art can obtain other drawings according to the drawings without any creative work.
图1是本发明实施例提供的一种关键词模糊匹配的方法流程图;FIG. 1 is a flowchart of a method for keyword fuzzy matching according to an embodiment of the present invention;
图2是本发明实施例提供的一种对待匹配文本中的每个字符分别进行关键词模糊匹配的方法流程图;FIG. 2 is a flowchart of a method for performing keyword fuzzy matching for each character in a text to be matched according to an embodiment of the present invention; FIG.
图3是本发明实施例提供的一种多关键词模糊匹配的装置框图;3 is a block diagram of a multi-keyword fuzzy matching device according to an embodiment of the present invention;
图4是根据一示例性实施例示出的一种用于关键词模糊匹配的装置的框图;4 is a block diagram of an apparatus for keyword fuzzy matching, according to an exemplary embodiment;
图5是本发明实施例中服务器的结构示意图;5 is a schematic structural diagram of a server in an embodiment of the present invention;
图6示出了用于执行根据本发明的多关键词模糊匹配方法的服务器的框图;6 shows a block diagram of a server for performing a multi-keyword fuzzy matching method according to the present invention;
图7示出了用于保持或者携带实现根据本发明的多关键词模糊匹配方法的程序代码的存储单元。Fig. 7 shows a storage unit for holding or carrying program code implementing the multi-keyword fuzzy matching method according to the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present invention are within the scope of the present invention.
本发明提供一种关键词模糊匹配的方法,如图1所示,包括:The invention provides a method for keyword fuzzy matching, as shown in FIG. 1 , comprising:
步骤S101:对待匹配文本中的每个字符,分别根据关键词集合获取该字符所属的关键词以及该字符在所属关键词中的索引位;Step S101: acquiring, for each character in the matched text, a keyword to which the character belongs and an index bit of the character in the associated keyword according to the keyword set;
具体的,扫描待匹配文本,每扫描到一个字符,从关键词集合中获取该字符所属的关键词,并且获取该字符在所属关键词中的索引位。待匹配文本中的一个字符可以对应一个或者多个关键词,也可以没有对应的关键词。Specifically, the text to be matched is scanned, and each time a character is scanned, the keyword to which the character belongs is obtained from the keyword set, and an index bit of the character in the keyword is obtained. One character in the text to be matched may correspond to one or more keywords, or may not have corresponding keywords.
本实施例中还包括:对关键词集合中每个关键词的每个字符,分别构建对应各字符的匹配规则,其中,匹配规则包含:字符、字符所属关键词、字符所属关键词包括的字符数目、以及该字符在所属关键词中的索引位;获取关键词中每个字符对应的匹配规则,形成对应该关键词的匹配规则集合;根据匹配规则集合构建从字符到匹配规则集合的倒排索引表。其中,倒排索引表包括:字符、以及该字符对应的所有匹配规则;基于前述内容可知,该字符对应的匹配规则可以是指包括该字符的匹配规则。In this embodiment, the method further includes: constructing, for each character of each keyword in the keyword set, a matching rule corresponding to each character, where the matching rule includes: a character, a keyword belonging to the character, and a character included in the keyword belonging to the character. a number, and an index bit of the character in the keyword; obtaining a matching rule corresponding to each character in the keyword, forming a matching rule set corresponding to the keyword; constructing an inverted row from the character to the matching rule set according to the matching rule set direction chart. The inverted index table includes: a character, and all matching rules corresponding to the character; based on the foregoing, the matching rule corresponding to the character may refer to a matching rule including the character.
当扫描待匹配文本中的任意一个字符时,根据上述倒排索引表获取该字符所属的关键词,以及该字符在所属关键词中的索引位。具体的,对待匹配文本中的每一个字符,分别遍历倒排索引表,当倒排索引表中包括该字符时,获取该字符对应的所有的匹配规则,并且 对每条匹配规则,分别获取该匹配规则包括的该字符所属关键词、以及该字符在所属关键词中的索引位。When scanning any one of the characters to be matched, the keyword to which the character belongs and the index bit of the character in the associated keyword are obtained according to the inverted index table. Specifically, each character in the matched text is traversed by the inverted index table, and when the inverted index table includes the character, all matching rules corresponding to the character are obtained, and For each matching rule, the keyword of the character included in the matching rule and the index bit of the character in the associated keyword are respectively obtained.
本申请实施例中,通过拆分关键词中的各个字符,并对每个字符建立匹配规则,建立倒排索引表,使通过遍历倒排索引表能够快速获取到该字符对应的所有的所属关键词,以及确定该字符在各个所属关键词中的索引位,不再需要对各个关键词分别进行匹配,使匹配过程更简便,更快速,效率更高。In the embodiment of the present application, by splitting each character in the keyword and establishing a matching rule for each character, an inverted index table is created, so that all the key points corresponding to the character can be quickly obtained by traversing the inverted index table. Words, as well as determining the index bits of the characters in the respective keywords, no longer need to match the respective keywords, making the matching process easier, faster, and more efficient.
关键词中可以包括通配符和非通配符,在本实施例中,非通配符统一称作字符,关键词中可以包含一个或者若干个通配符,这些通配符可以是连续的,也可以是间隔的,通配符的长度可以为任一字符长度。本实施例中,可以对关键词集合进行增添、修改和删除关键词操作,相应的,在对关键词集合进行增添、修改和删除操作时,按照具体的操作,更新倒排索引表中的内容。The keywords may include wildcards and non-wildcards. In this embodiment, non-wildcards are collectively referred to as characters. The keywords may contain one or several wildcards. These wildcards may be consecutive or intervald, and the length of the wildcard. Can be any character length. In this embodiment, the keyword set may be added, modified, and deleted. In response, when the keyword set is added, modified, and deleted, the content in the inverted index table is updated according to a specific operation. .
步骤S102:依据所述字符在所属关键词中的索引位判断该字符是否是所述关键词的首字符,如果该字符是所属关键词的首字符,则执行步骤S103;如果不是首字符,则执行步骤S104;Step S102: determining whether the character is the first character of the keyword according to the index bit of the character in the keyword, if the character is the first character of the keyword, step S103 is performed; if it is not the first character, Go to step S104;
具体的,关键词中的字符分为首字符和非首字符,其中,首字符指的是该关键词中第一个非通配符。非首字符指的是关键词中除首字符外其他的非通配符。当关键词中仅有一个非通配符时,该字符是首字符。Specifically, the characters in the keyword are divided into a first character and a non-first character, wherein the first character refers to the first non-wildcard in the keyword. Non-first characters refer to non-wildcards in the keyword other than the first character. When there is only one non-wildcard in the keyword, the character is the first character.
步骤S103:在匹配信息集合中记录该字符所属关键词,并在记录中标记该关键词的首字符存在于待匹配文本中;Step S103: Record the keyword to which the character belongs in the matching information set, and mark the first character of the keyword in the record to be in the text to be matched;
具体的,步骤S103中,如果该字符是所属关键词的首字符,新建该字符所属关键词对应的匹配过程信息,将该匹配过程信息保存在匹配信息集合中;并将该字符在待匹配文本中的索引位记录到该匹配过程信息中。其中,匹配过程信息与字符所属关键词一一对应。Specifically, in step S103, if the character is the first character of the keyword, the matching process information corresponding to the keyword to which the character belongs is newly created, and the matching process information is saved in the matching information set; and the character is in the to-be-matched text. The index bit in the record is recorded in the matching process information. The matching process information is in one-to-one correspondence with the keywords to which the characters belong.
进一步的,将字符在待匹配文本中的索引位记录到匹配过程信息中,能够标记该所属关键词中的该字符存在于待匹配文本中,并且记录该字符在待匹配文本中的索引位,以便后续命中关键词之后,输出匹配信息时使用。Further, the index bit of the character in the text to be matched is recorded into the matching process information, and the character in the belonging keyword can be marked to exist in the text to be matched, and the index bit of the character in the text to be matched is recorded. Used to output matching information after subsequent hits of keywords.
步骤S104:当匹配信息集合中存在字符所属关键词的记录时,获取该字符所属关键词的记录,并在该记录中标记关键词中的该字符存在于待匹配文本中。具体的,步骤S104中,如果该字符不是所属关键词的首字符,则查找匹配信息集合,判断所述匹配信息集合中是否存在所述字符所属关键词的记录,如果存在,则获取该字符所属关键词的记录,并在该记录中标记关键词中的该字符存在于待匹配文本中;如果不存在,则忽略该字符,继续扫描待匹配文本中的下一个字符。Step S104: When there is a record of the keyword to which the character belongs in the matching information set, the record of the keyword to which the character belongs is obtained, and the character in the marked keyword is present in the text to be matched in the record. Specifically, in step S104, if the character is not the first character of the keyword, the matching information set is searched, and it is determined whether the record of the keyword belongs to the matching information set, and if yes, the character is acquired. The record of the keyword, and the character in the markup keyword exists in the text to be matched in the record; if it does not exist, the character is ignored, and the next character in the text to be matched is continuously scanned.
本申请实施例中,步骤S104中,所述判断所述匹配信息集合中是否存在所述字符所属关键词的记录,可以为判断所述匹配过程集合中是否存在所述字符所属关键词对应的匹配 过程信息,如果存在,则获取该字符所属关键词对应的匹配过程信息,并将该字符在待匹配文本中的索引位记录到该匹配过程信息中;如果不存在,则忽略该字符。In the embodiment of the present application, in step S104, the determining whether the record of the keyword belongs to the matching information set may be used to determine whether the matching of the keyword belongs to the matching process set. The process information, if any, obtains the matching process information corresponding to the keyword to which the character belongs, and records the index bit of the character in the text to be matched into the matching process information; if not, the character is ignored.
具体的,当确定该字符是非首字符时,通过判断匹配信息集合中是否存在该字符所属关键词对应的匹配过程信息,来判定是否已记录过该关键词。Specifically, when it is determined that the character is a non-first character, it is determined whether the keyword has been recorded by determining whether there is matching process information corresponding to the keyword to which the character belongs in the matching information set.
如果存在该字符所属关键词对应的匹配过程信息,则判定该关键词已被记录过,则可以获取该字符所属关键词的记录,具体可以为该字符所属关键词对应的匹配过程信息,将该字符在待匹配文本中的索引位记录到该匹配过程信息中。If there is matching process information corresponding to the keyword to which the character belongs, if it is determined that the keyword has been recorded, the record of the keyword to which the character belongs may be obtained, which may be the matching process information corresponding to the keyword to which the character belongs. The index bit of the character in the text to be matched is recorded in the matching process information.
如果不存在该字符所属关键词对应的匹配过程信息,则判定该关键词没有被记录过,且该字符也不是该关键词的首字符,则此时可以忽略该字符,从待匹配文本中获取下一字符进行匹配。If there is no matching process information corresponding to the keyword to which the character belongs, it is determined that the keyword has not been recorded, and the character is not the first character of the keyword, then the character can be ignored and obtained from the text to be matched. The next character is matched.
需要说明的是,所述匹配过程信息与所述字符所属关键词一一对应,且每个匹配过程信息中包括的位数与对应的关键词包括的字符数相同;该匹配过程信息中的每一位分别用于标记对应关键词中对应位数的字符是否在待匹配文本中出现,如果出现,则该匹配过程信息中对应位被置为对应位数的字符在待匹配文本中的索引位。It should be noted that the matching process information is in one-to-one correspondence with the keyword to which the character belongs, and the number of bits included in each matching process information is the same as the number of characters included in the corresponding keyword; each of the matching process information A character used to mark the corresponding number of bits in the corresponding keyword respectively appears in the text to be matched. If it occurs, the corresponding bit in the matching process information is set to the index bit of the character corresponding to the number of bits in the text to be matched. .
由此可见,当一条匹配过程信息中的每一位都被置为对应位数的字符在待匹配文本中的索引位时,则可以确定该匹配过程信息对应的关键词的每一位字符都在待匹配文本中出现,则表明该关键词被该待匹配文本命中。该具体匹配过程在后续实施例中会详细说明。It can be seen that when each bit in a matching process information is set to the index bit of the character corresponding to the number of bits in the text to be matched, it can be determined that each character of the keyword corresponding to the matching process information is Appearing in the text to be matched indicates that the keyword is hit by the text to be matched. This specific matching process will be described in detail in the subsequent embodiments.
进一步的,当确定该关键词被该待匹配文本命中时,所述方法还可以包括:输出匹配信息。具体的,可以在确定匹配过程信息中已记录对应的关键词中所有字符在待匹配文本中的索引位时,确定待匹配文本命中该关键词,此时可以根据各字符在待匹配文本中的索引位从待匹配文本中获取匹配信息,输出匹配信息。Further, when it is determined that the keyword is hit by the to-be-matched text, the method may further include: outputting the matching information. Specifically, when it is determined that the index bits of all the characters in the corresponding keyword in the matching text are recorded in the matching process information, the text to be matched is determined to hit the keyword, and the character may be in the text to be matched according to each character. The index bit obtains matching information from the text to be matched, and outputs matching information.
本申请实施例中,通过使用匹配过程信息,能够确定关键词中的每一个字符是否存在于待匹配文本中,以及在待匹配文本中的索引位。In the embodiment of the present application, by using the matching process information, it can be determined whether each character in the keyword exists in the text to be matched, and an index bit in the text to be matched.
在本实施例中,当一关键词中的各字符都被标记存在于待匹配文本中时,则判定待匹配文本命中该关键词。具体的,字符所属关键词对应的匹配过程信息中已记录字符所属关键词中各字符在待匹配文本中的索引位时,确定待匹配文本命中该关键词。In this embodiment, when each character in a keyword is marked in the text to be matched, it is determined that the text to be matched hits the keyword. Specifically, when the index of the characters in the to-be-matched text is recorded in the matching process information corresponding to the keyword to which the character belongs, it is determined that the to-be-matched text hits the keyword.
可选的,本实施例中,步骤S104中,在所述获取该字符所属关键词对应的匹配过程信息之后,所述将该字符在待匹配文本中的索引位记录到该匹配过程信息中之前,还包括:判断所述匹配过程信息中是否已经记录了该字符在待匹配文本中的索引位,如果已记录,则复制该匹配过程信息,并用该字符在待匹配文本中的当前索引位更新所述复制的匹配过程信息中已记录的该字符在待匹配文本中的索引位;如果没有记录,则执行将该字符在待匹配文本中的索引位记录到该匹配过程信息中的步骤。Optionally, in this embodiment, in step S104, after the matching process information corresponding to the keyword to which the character belongs is obtained, before the index bit of the character in the text to be matched is recorded in the matching process information, The method further includes: determining whether an index bit of the character in the text to be matched has been recorded in the matching process information, and if the record is already, copying the matching process information, and updating the current index bit in the text to be matched with the character And the step of recording the index bit of the character in the text to be matched in the copied matching process information; if there is no record, performing the step of recording the index bit of the character in the text to be matched into the matching process information.
本实施例中,还包括:对关键词集合中的各关键词,分别构建包含关键词中字符数量、各字符与其前一个字符的有效距离的字符距离规则,形成字符距离规则集合;当对待匹配 文本进行匹配时,根据字符距离规则集合对当前字符所属关键词进行匹配校验,具体包括:从当前字符所属关键词中获取当前字符的前一个字符在待匹配文本中的索引位,根据该索引位和当前字符在待匹配文本中的索引位,计算当前字符与当前字符的前一个字符之间的第一距离;从当前字符所属关键词对应的字符距离规则中获取当前字符与当前字符的前一个字符之间的第二距离;如果第二距离表示任一间隔长度,或者第二距离表示的间隔长度大于第一距离表示的间隔长度,则校验结果为添加成功,获取当前字符的下一个字符与其所属关键词进行匹配;如果第二距离表示不是任一间隔长度,并且第二距离表示的间隔长度小于第一距离表示的间隔长度,则校验结果为失败,匹配过程信息失效,结束对字符所属关键词的匹配。In this embodiment, the method further includes: constructing a character distance rule including the number of characters in the keyword and the effective distance between each character and the previous character for each keyword in the keyword set to form a character distance rule set; When the text is matched, the matching and verifying of the keyword of the current character according to the character distance rule set includes: obtaining an index bit of the previous character of the current character in the text to be matched from the keyword of the current character, according to the index The index bit of the bit and the current character in the text to be matched, the first distance between the current character and the previous character of the current character is calculated; and the current character and the current character are obtained from the character distance rule corresponding to the keyword of the current character. a second distance between characters; if the second distance represents any interval length, or the second distance represents an interval length greater than the interval length indicated by the first distance, the verification result is that the addition is successful, and the next character of the current character is obtained. The character matches the keyword it belongs to; if the second distance indicates that it is not any interval length, and the interval length indicated by the second distance is smaller than the interval length indicated by the first distance, the verification result is a failure, the matching process information is invalid, and the end is The match of the keyword to which the character belongs.
另一方面,结合上述技术方案,本实施例提供一个具体示例进行说明,具体如下:On the other hand, in combination with the above technical solutions, this embodiment provides a specific example for description, as follows:
可以预先对关键词集合中每个关键词的每个字符,分别构建包含该字符、该字符所属关键词、该字符所属关键词包括的字符数目、以及该字符在所属关键词中的索引位的匹配规则,形成匹配规则集合;根据匹配规则集合构建从字符到匹配规则集合的倒排索引表。Each character of each keyword in the keyword set may be separately constructed to include the character, the keyword to which the character belongs, the number of characters included in the keyword to which the character belongs, and the index bit of the character in the keyword to which the keyword belongs. Matching rules to form a matching rule set; constructing an inverted index table from a character to a matching rule set according to the matching rule set.
具体的,对关键词集合中每个关键词的每个字符,分别构建包含该字符、该字符所有的所属关键词、该字符所属关键词包括的字符数目、以及该字符在各个所属关键词中的索引位的匹配规则,关键词集合中的每个关键词的每个字符至少有一个匹配规则,由关键词集合中的每个关键词中的所有字符对应的匹配规则组成匹配规则集合,对所有的不重复的字符构建该字符到匹配规则集合的倒排索引表。Specifically, each character of each keyword in the keyword set is respectively constructed to include the character, all the keywords of the character, the number of characters included in the keyword to which the character belongs, and the character in each keyword. The matching rule of the index bit, each character of each keyword in the keyword set has at least one matching rule, and the matching rule corresponding to all the characters in each keyword in the keyword set constitutes a matching rule set, All non-repeating characters construct this character into the inverted index table of the matching rule set.
对关键词集合中的各个关键词,分别构建包含关键词中字符数量、各字符与其前一个字符的有效距离的字符距离规则,形成字符距离规则集合。每个关键词对应一个字符距离规则。其中,字符数量指的是关键词包括的非通配符数量,各字符与其前一个字符的有效距离指的是各非通配符与其前一个非通配符之间的距离。For each keyword in the keyword set, a character distance rule including the number of characters in the keyword and the effective distance between each character and the previous character is constructed to form a character distance rule set. Each keyword corresponds to a character distance rule. The number of characters refers to the number of non-wildcards included in the keyword. The effective distance between each character and its previous character refers to the distance between each non-wildcard and its previous non-wildcard.
各字符与其前一个字符可以是连续的,例如连续时可以设定字符距离为0。Each character can be continuous with its previous character. For example, the character distance can be set to 0 when continuous.
各字符与其前一个字符也可以是不连续的。其中,不连续的情况可以分为两种,一种情况:任意长度间隔,可以为字符与前一个字符之间包括表示任意长度的通配符“*”,此时可以设定字符距离为-1;另一种情况:固定长度间隔,可以为字符与前一个字符之间包括n个表示一个字符长度的通配符“??”,此时可以设定字符距离为n(n为自然数),例如,字符与前一个字符之间包括通配符“??”,则对应的字符距离即为2。Each character and its previous character can also be discontinuous. The case of discontinuity can be divided into two types: one case: any length interval, which can include a wildcard "*" indicating an arbitrary length between the character and the previous character, and the character distance can be set to -1 at this time; Another case: fixed length interval, which can include n wildcard characters "??" between the character and the previous character. In this case, you can set the character distance to n (n is a natural number), for example, characters. When the wildcard "??" is included with the previous character, the corresponding character distance is 2.
本实施例中,能够通过使用字符距离规则对关键词进行校验。字符距离规则集合也可以在对关键词进行匹配校验时生成。In this embodiment, the keyword can be verified by using the character distance rule. The character distance rule set can also be generated when the keyword is matched and checked.
例如:关键词集合中包含两个关键词,其中,关键词一为:代??开*发票,关键词二为:找??代理。For example, the keyword set contains two keywords, of which the keyword one is: generation? ? Open * invoice, keyword two: find? ? proxy.
首先,对每个关键词中的各个字符构建匹配规则,匹配规则包含该字符、该字符所属关键词、该字符所属关键词包括的字符数目、以及该字符在所属关键词中的索引位;分别 构建关键词一和关键词二中各字符的匹配规则,将其组成匹配规则集合,具体如表1所示,构建各字符与匹配规则集合之间的映射关系并编号,得到倒排索引表,具体如表2所示。First, a matching rule is constructed for each character in each keyword, and the matching rule includes the character, the keyword to which the character belongs, the number of characters included in the keyword to which the character belongs, and the index bit of the character in the keyword; The matching rules of each character in the keyword one and the keyword two are constructed, and the matching rule set is formed. As shown in Table 1, the mapping relationship between each character and the matching rule set is constructed and numbered, and the inverted index table is obtained. The details are shown in Table 2.
Figure PCTCN2016104693-appb-000001
Figure PCTCN2016104693-appb-000001
表1Table 1
下面结合表1所示,对匹配规则集合举例说明。如关键词“代??开*发票”,其包括4个有效字符,对应构建4条匹配规则。The following is an example of a matching rule set as shown in Table 1. For example, the keyword "generation? open * invoice", which includes 4 valid characters, corresponding to the construction of 4 matching rules.
第一个字符(character)“代”,对应关键词(keyword)为“代??开*发票”,该关键词包括字符数目(size)为4,“代”在该关键词中的索引位(index)为0。The first character (character) "generation", the corresponding keyword (keyword) is "generation? open * invoice", the keyword includes the number of characters (size) is 4, the "generation" index in the keyword (index) is 0.
第二个字符(character)“开”,对应关键词(keyword)为“代??开*发票”,该关键词包括字符数目(size)为4,“开”在该关键词中的索引位(index)为1。The second character (character) is "on", and the corresponding keyword (keyword) is "generation? open * invoice". The keyword includes the number of characters (4), and the "open" index position in the keyword. (index) is 1.
第三个字符(character)“发”,对应关键词(keyword)为“代??开*发票”,该关键词包括字符数目(size)为4,“发”在该关键词中的索引位(index)为2。The third character (character) is "sent", and the corresponding keyword (keyword) is "generation? open * invoice", the keyword includes the number of characters (size) is 4, "send" the index bit in the keyword (index) is 2.
第四个字符(character)“票”,对应关键词(keyword)为“代??开*发票”,该关键词包括字符数目(size)为4,“票”在该关键词中的索引位(index)为3。The fourth character (character) "voucher", the corresponding keyword (keyword) is "generation? open * invoice", the keyword includes the number of characters (size) is 4, the index of the "ticket" in the keyword (index) is 3.
Figure PCTCN2016104693-appb-000002
Figure PCTCN2016104693-appb-000002
表2Table 2
下面结合表2,对倒排索引表举例说明。The following is an example of the inverted index table in conjunction with Table 2.
例如,字符“代”,分别存在于关键词“代??开*发票”和关键词“找??代理”中,因此对应两条匹配规则,分别构建字符“代”与两条匹配规则之间的映射关系,并编号,保存在倒排索引表中,如表2所示。For example, the characters "generation" exist in the keyword "generation? open * invoice" and the keyword "find the agent", so corresponding to the two matching rules, respectively construct the character "generation" and two matching rules The mapping relationship between them, and numbered, is stored in the inverted index table, as shown in Table 2.
再例如,字符“开”,仅存在于关键词“代??开*发票”中,因此对应一条匹配规则,构建字符“开”与该条匹配规则之间的映射关系,并编号,保存在倒排索引表中,如表2所示。For another example, the character "on" exists only in the keyword "generation? open * invoice", so corresponding to a matching rule, the mapping relationship between the character "open" and the matching rule is constructed, and numbered, saved in In the inverted index table, as shown in Table 2.
进一步的,举例说明本申请前述实施例所述的字符距离规则。Further, the character distance rule described in the foregoing embodiment of the present application is exemplified.
对关键词一:代??开*发票For keyword one: generation? ? Open * invoice
其中,字符数量为4个,第二个字符“开”与第一个字符“代”之间有两个通配符“??”,即字符距离为2。第三个字符“发”与第二个字符“开”之间有任意长度间隔的通配符“*”,即字符距离为-1,第四个字符“票”与第三个字符“发”之间没有通配符,即字符距离为0。因此,构建得到的关键词一对应的字符距离规则为[4,2,-1,0]。Among them, the number of characters is four, and there is two wildcard characters "??" between the second character "on" and the first character "generation", that is, the character distance is 2. There is a wildcard "*" of any length between the third character "hair" and the second character "on", that is, the character distance is -1, and the fourth character "ticket" and the third character "fat" There is no wildcard between them, that is, the character distance is 0. Therefore, the character distance rule corresponding to the constructed keyword is [4, 2, -1, 0].
对关键词二:找??代理For keyword two: find? ? proxy
依照对关键词一构建字符距离规则的方法,构建得到的关键词二对应的字符距离规则为[3,2,0]。According to the method of constructing the character distance rule for the keyword one, the character distance rule corresponding to the constructed keyword two is [3, 2, 0].
此时,字符距离规则集合中包含两个字符距离规则,分别为关键词一“代??开*发票”的字符距离规则[4,2,-1,0]和关键词二“找??代理”的字符距离规则[3,2,0]。At this time, the character distance rule set contains two character distance rules, which are the character distance rule [4, 2, -1, 0] and the keyword "search" of the keyword "generation? The proxy's character distance rule [3, 2, 0].
当需要对待匹配文本进行关键词模糊匹配时,对待匹配文本中的每个字符分别进行关键词模糊匹配,如图2所示,具体如下:When it is necessary to perform keyword fuzzy matching on the matched text, each character in the matched text is respectively subjected to keyword fuzzy matching, as shown in FIG. 2, as follows:
步骤201:从待匹配文本中获取一个字符作为当前字符;Step 201: Obtain a character from the text to be matched as the current character;
例如,待匹配文本为:寻找便宜代理开营业发票。扫描待匹配文本,具体操作如下:检索字符‘寻’,将其作为当前字符。For example, the text to be matched is: looking for a cheap agent to open a business invoice. Scan the text to be matched, as follows: Retrieve the character 'seek' as the current character.
步骤202:从倒排索引表中获取与当前字符对应的所有的匹配规则,对每个匹配规则,判断该匹配规则所属关键词的匹配过程信息集合是否为空,如果匹配过程信息集合为空,则执行步骤203;如果匹配过程信息集合不为空,则执行步骤207。Step 202: Obtain all matching rules corresponding to the current character from the inverted index table, and determine, for each matching rule, whether the matching process information set of the keyword to which the matching rule belongs is empty, and if the matching process information set is empty, Then, step 203 is performed; if the matching process information set is not empty, step 207 is performed.
本实施例中,如果从倒排索引表中获取不到与当前字符对应的匹配规则时,则忽略当前字符,从待匹配文本中获取当前字符的下一个字符,将其作为当前字符,执行步骤202。In this embodiment, if the matching rule corresponding to the current character is not obtained from the inverted index table, the current character is ignored, and the next character of the current character is obtained from the text to be matched, and the current character is used as the current character. 202.
步骤203:判断当前字符是否是关键词的首字符,是则执行步骤204;否则执行步骤212,即从待匹配文本中获取当前字符的下一个字符,将其作为当前字符,再执行步骤202。Step 203: Determine whether the current character is the first character of the keyword, if yes, execute step 204; otherwise, execute step 212, that is, obtain the next character of the current character from the text to be matched, and use it as the current character, and then perform step 202.
具体的,根据匹配规则中记录的当前字符在所属关键词中的索引位进行判断。Specifically, the judgment is performed according to an index bit of the current character recorded in the matching rule in the keyword.
步骤204:在匹配过程信息集合中新增一个匹配过程信息,将当前字符在待匹配文本中的索引位记录到该匹配过程信息中,执行步骤205。 Step 204: Add a matching process information to the matching process information set, and record the index bit of the current character in the to-be-matched text into the matching process information, and perform step 205.
本实施例中,匹配过程信息作用在于,记录一个关键词中的各个字符在待匹配文本中的索引位。本实施例中,新建匹配过程信息时,可以为该匹配过程信息中的每个元素设置初始值,匹配过程信息中的各个元素代表关键词中的各个字符是否在待匹配文本中出现,以及在待匹配文本中的索引位。In this embodiment, the matching process information is used to record the index bits of each character in a keyword in the text to be matched. In this embodiment, when the matching process information is newly created, an initial value may be set for each element in the matching process information, and each element in the matching process information represents whether each character in the keyword appears in the text to be matched, and The index bit in the text to be matched.
具体的,该匹配过程信息中包括的元素数等于对应关键词包括的字符数。且每个元素表示该关键词中的各个字符是否被待匹配文本包括,如果没有,则该字符对应的元素为初始值,如果有,则该字符对应的元素为该字符在待匹配文本中的索引位。Specifically, the number of elements included in the matching process information is equal to the number of characters included in the corresponding keyword. And each element indicates whether each character in the keyword is included in the text to be matched. If not, the element corresponding to the character is an initial value, and if so, the element corresponding to the character is the character in the text to be matched. Index bit.
例如:扫描待匹配文本中的字符是“找”时,判断出“找”在其所属关键词“找??代理”中是首字符,此时,新建一个匹配过程信息,一个匹配过程信息与一个关键词相对应,且匹配过程信息包括的元素数等于该关键词包括的字符数,且新建的匹配过程信息中各元素均被置为初始值。例如,将新建的匹配过程信息中各个元素初始值置为-1,表示关键词中的各个字符均未出现在待匹配文本中,例如:关键词“找??代理”中有三个字符,创建包含三个元素的匹配过程信息,并将各元素置为-1,即新建的匹配过程信息为index[-1,-1,-1]。获取扫描到的字符“找”在待匹配文本中的索引位1,将匹配过程信息中与“找”对应的索引位更新为1,更新后的匹配过程信息为index[1,-1,-1]。For example, when the character in the text to be matched is "find", it is judged that "find" is the first character in the keyword "find agent", and at this time, a new matching process information, a matching process information and A keyword corresponds, and the matching process information includes the number of elements equal to the number of characters included in the keyword, and each element in the newly created matching process information is set to an initial value. For example, the initial value of each element in the newly created matching process information is set to -1, indicating that each character in the keyword does not appear in the text to be matched, for example, the keyword "find?? proxy" has three characters, and is created. Contains the matching process information of three elements, and sets each element to -1, that is, the newly created matching process information is index[-1, -1, -1]. Obtain the scanned character "find" index bit 1 in the text to be matched, update the index bit corresponding to "find" in the matching process information to 1, and the updated matching process information is index[1,-1,- 1].
步骤205:根据匹配过程信息判断是否匹配完成,如果完成匹配,则执行步骤206;如果未完成匹配,则执行步骤212即从待匹配文本中获取当前字符的下一个字符,将其作为当前字符,再执行步骤202。Step 205: Determine whether the matching is completed according to the matching process information. If the matching is completed, go to step 206. If the matching is not completed, execute step 212 to obtain the next character of the current character from the text to be matched, and use it as the current character. Step 202 is performed again.
步骤S205中,所述根据匹配过程信息判断是否匹配完成,具体的可以包括:判断匹配过程信息中是否存在值为初始值的元素,如果是则表明未完成匹配,否则表明匹配完成。In step S205, the determining whether the matching is completed according to the matching process information may include: determining whether there is an element whose value is an initial value in the matching process information, and if yes, indicating that the matching is not completed, otherwise indicating that the matching is completed.
需要说明的是,前述实施例已经表明,当待匹配文本中包括关键词中的字符时,会将该关键词对应的匹配过程信息中该字符对应的元素置为该字符在待匹配文本中的索引位,否则,该字符对应的元素保持初始值。因而,当匹配过程信息中包括的所有元素均是非初始值时,可以确定该待匹配文本中包括该关键词中所有的字符,即为匹配完成。It should be noted that, in the foregoing embodiment, when the character to be matched includes the character in the keyword, the element corresponding to the character in the matching process information corresponding to the keyword is set as the character in the text to be matched. The index bit, otherwise, the element corresponding to the character retains the initial value. Therefore, when all the elements included in the matching process information are non-initial values, it may be determined that all the characters in the keyword are included in the to-be-matched text, that is, the matching is completed.
步骤206:根据匹配过程信息输出匹配信息。Step 206: Output matching information according to the matching process information.
具体的,根据该匹配过程信息中记录的第一个字符在待匹配文本中的索引位,以及最后一个字符在待匹配文本中的索引位,从待匹配文本中获取上述两个索引位之间的字符,包括这两个索引位上的字符,将其作为匹配信息,输出该匹配信息。Specifically, according to the index bit of the first character recorded in the matching process information in the text to be matched, and the index bit of the last character in the text to be matched, the two index bits are obtained from the text to be matched. The character, including the characters on the two index bits, is used as matching information to output the matching information.
步骤207:判断当前字符是否为所属关键词的首字符,是则执行步骤208;否则执行步骤209。Step 207: Determine whether the current character is the first character of the keyword, if yes, go to step 208; otherwise, go to step 209.
步骤208:在匹配过程信息中添加一个新的匹配过程信息,将当前字符在待匹配文本中的索引位记录到该匹配过程信息中,执行步骤205。Step 208: Add a new matching process information in the matching process information, record the index bit of the current character in the text to be matched into the matching process information, and perform step 205.
步骤209:从匹配过程信息集合中获取当前字符所有的所属关键词对应的匹配过程信息,将当前字符在待匹配文本中的索引位记录到每个匹配过程信息中,执行步骤210。 In step 209, the matching process information corresponding to all the keywords of the current character is obtained from the matching process information set, and the index bit of the current character in the text to be matched is recorded in each matching process information, and step 210 is performed.
本实施例中,从匹配过程信息集合中获取当前字符所有的所属关键词对应的匹配过程信息之后,将当前字符在待匹配文本中的索引位记录到每个匹配过程信息中之前,还包括,判断是否已经将当前字符在待匹配文本中的索引位记录在对应的匹配过程信息中,是则复制当前匹配过程信息,并将当前字符在待匹配文本中的索引位更新到复制后的匹配过程信息中;否则将当前字符在待匹配文本中的索引位记录到每个匹配过程信息中,执行步骤210。In this embodiment, after obtaining the matching process information corresponding to all the keywords of the current character from the matching process information set, before the index bit of the current character in the text to be matched is recorded in each matching process information, the method further includes: Determining whether the index bit of the current character in the text to be matched has been recorded in the corresponding matching process information, and copying the current matching process information, and updating the index bit of the current character in the text to be matched to the matching process after copying In the information; otherwise, the index bit of the current character in the text to be matched is recorded into each matching process information, and step 210 is performed.
步骤210:根据字符距离规则对每个匹配过程信息进行距离校验,如果校验结果成功,则执行步骤205;如果校验结果为失败,则执行步骤211。Step 210: Perform a distance check on each matching process information according to the character distance rule. If the verification result is successful, step 205 is performed; if the verification result is a failure, step 211 is performed.
步骤211,将该关键词标记为失效,即结束对当前字符所属关键词的匹配。In step 211, the keyword is marked as invalid, that is, the matching of the keyword to which the current character belongs is ended.
步骤212,从待匹配文本中获取当前字符的下一个字符,将其作为当前字符;然后执行步骤202。Step 212: Obtain the next character of the current character from the text to be matched, and use it as the current character; then perform step 202.
本实施例中,通过检验关键词中的每个字符是否都已经出现在待匹配文本中,以及关键词对应的字符距离规则与该关键词中各字符在待匹配文本中的索引位是否符合预设关系来判定字符匹配的情况。对于字符匹配来说可能存在三种情况:一种情况、当前字符存在的关键词中存在未匹配的其他字符,此时需要继续匹配其他字符。另一种情况、当前字符存在的关键词中其他字符都已匹配过,并且当前字符在该关键词中的前一个字符与其之间的距离大于或者等于当前字符在待匹配文本中的前一个字符与其之间的距离,以此确定该关键词有效,并且被命中。又一种情况、当前字符存在的关键词中其他字符都已匹配过,但是当前字符在该关键词中的前一个字符与其之间的距离小于当前字符在待匹配文本中的前一个字符与其之间的距离,以此确定该关键词失效。In this embodiment, it is verified whether each character in the keyword has already appeared in the text to be matched, and whether the character distance rule corresponding to the keyword and the index bit in the text to be matched in the keyword match the pre-pre- Set the relationship to determine the case where the characters match. There are three cases for character matching: in one case, there are other characters in the keyword in which the current character exists, and it is necessary to continue matching other characters. In another case, other characters in the keyword in which the current character exists have been matched, and the distance between the previous character of the current character and the previous character is greater than or equal to the previous character of the current character in the text to be matched. The distance between them is used to determine that the keyword is valid and is hit. In another case, other characters in the keyword existing in the current character have been matched, but the distance between the previous character of the current character and the current character is smaller than the previous character of the current character in the text to be matched. The distance between them to determine the keyword failure.
可以通过以下方式实现:获取当前字符所属关键词对应的字符距离规则,从字符所属关键词对应的匹配过程信息中获取当前字符的前一个字符在待匹配文本中的索引位,根据该索引位和当前字符在待匹配文本中的索引位,计算当前字符与当前字符的前一个字符之间的第一距离,从当前字符所属关键词对应的字符距离规则中获取当前字符与当前字符的前一个字符之间的第二距离,当第二距离表示任一间隔长度,或者第二距离大于第一距离时,判断匹配过程信息中是否存在值为初始值的元素,是则表示添加成功,从待匹配文本中获取当前字符的下一字符,将其作为当前字符进行匹配;否则表示匹配完成。如果第二距离表示不是任一间隔长度,并且第二距离表示的间隔长度小于第一距离表示的间隔长度时,表示当前字符所属的关键词匹配失效,结束对关键词的匹配,可以从待匹配文本中获取当前字符的下一字符进行匹配。The method can be implemented as follows: obtaining a character distance rule corresponding to a keyword to which the current character belongs, and obtaining an index bit of a character of the current character in the to-be-matched text from the matching process information corresponding to the keyword to which the character belongs, according to the index bit and The index of the current character in the text to be matched, the first distance between the current character and the previous character of the current character is calculated, and the current character and the previous character of the current character are obtained from the character distance rule corresponding to the keyword of the current character. The second distance between the two, when the second distance indicates the length of any interval, or the second distance is greater than the first distance, it is determined whether there is an element with an initial value in the matching process information, and the addition indicates that the adding is successful, from the to-be-matched Gets the next character of the current character in the text and matches it as the current character; otherwise, the match is completed. If the second distance indicates that the interval length is not any interval, and the interval length indicated by the second distance is smaller than the interval length indicated by the first distance, indicating that the keyword matching of the current character is invalid, and ending the matching of the keyword may be matched from Get the next character of the current character in the text to match.
针对上述实施方式以对待匹配文本“寻找便宜代理开营业发票”进行模糊匹配,举例说明:For the above embodiment, the fuzzy matching is performed by matching the text "Looking for a cheap agent to open a business invoice", for example:
待匹配文本为:“寻找便宜代理开营业发票”。扫描待匹配文本,具体操作如下:The text to be matched is: “Looking for a cheap agent to open a business invoice”. Scan the text to be matched, as follows:
检索字符‘寻’,在待匹配文本中的索引位是0,Retrieve the character 'seek', the index bit in the text to be matched is 0,
在倒排索引表中没有找到对应的匹配规则,因此忽略该字符; The corresponding matching rule was not found in the inverted index table, so the character is ignored.
检索字符‘找’,在待匹配文本中的索引位是1,Search for the character ‘find’ and the index bit in the text to be matched is 1.
在倒排索引表中找到对应的匹配规则6,且当前字符‘找’是匹配规则6对应关键词“找??代理”的首字符。新建匹配过程信息,过程信息编号为1,并用当前字符“找”在待匹配文本的索引位1替换关键词“找??代理”对应的匹配过程信息中对应元素的初始值,如表3所示:The corresponding matching rule 6 is found in the inverted index table, and the current character 'find' is the first character of the matching rule 6 corresponding to the keyword "finding agent". Create a matching process information, the process information number is 1, and use the current character "find" to replace the initial value of the corresponding element in the matching process information corresponding to the keyword "find the proxy" in the index bit 1 of the text to be matched, as shown in Table 3. Show:
关键词Key words 匹配过程信息Matching process information 过程信息编号Process information number
找??代理Find? ? proxy index[]:[1,-1,-1]Index[]:[1,-1,-1] 11
表3table 3
对过程信息编号1对应的匹配过程信息进行校验,匹配过程信息index[]:[1,-1,-1],只有第1位为非-1,而其它位上的数据均为初始值-1,此时,表示暂未匹配完成。对于当前字符“找”,在其所属关键词“找??代理”中是首字符,不存在对应的字符距离规则,因此不再需要根据字符距离规则进行校验。The matching process information corresponding to the process information number 1 is verified, and the process information index[]:[1,-1,-1] is matched, and only the first bit is non-1, and the data of other bits are initial values. -1, at this time, indicates that the temporary match is not completed. For the current character "find", it is the first character in its keyword "find the proxy", there is no corresponding character distance rule, so it is no longer necessary to check according to the character distance rule.
检索字符‘便’,在待匹配文本中的索引位是2,从倒排索引表中没找到对应的匹配规则,忽略。The character ‘Yes’ is retrieved, and the index bit in the text to be matched is 2, and the corresponding matching rule is not found from the inverted index table, and is ignored.
检索字符‘宜’,在待匹配文本中的索引位是3,从倒排索引表中没找到对应的匹配规则,忽略。The character ‘Yes’ is retrieved, and the index bit in the text to be matched is 3, and the corresponding matching rule is not found from the inverted index table, and is ignored.
检索字符‘代’,在待匹配文本中的索引位是4,从倒排索引表中找到对应的匹配规则1和2。The character 'generation' is retrieved, and the index bit in the text to be matched is 4, and the corresponding matching rules 1 and 2 are found from the inverted index table.
处理匹配规则1,当前字符‘代’是匹配规则1对应关键词“代??开*发票”的首字符。新建匹配过程信息,过程信息编号为2,并用当前字符“代”在待匹配文本中的索引位4替换匹配过程信息2中对应元素的初始值,如表4所示:The matching rule 1 is processed, and the current character 'generation' is the first character of the matching rule 1 corresponding keyword "generation? invoice". The matching process information is newly created. The process information number is 2, and the initial value of the corresponding element in the matching process information 2 is replaced by the index bit 4 in the text to be matched with the current character "generation", as shown in Table 4:
关键词Key words 匹配过程信息Matching process information 过程信息编号Process information number
找??代理Find? ? proxy index[]:[1,-1,-1]Index[]:[1,-1,-1] 11
代??开*发票generation? ? Open * invoice index[]:[4,-1,-1,-1]Index[]:[4,-1,-1,-1] 22
表4Table 4
对过程信息编号2对应的匹配过程信息进行校验,匹配过程信息index[]:[4,-1,-1,-1]中只有第1位为非初始值-1,剩余两位上的数据均为初始值-1,因此暂未匹配完成。The matching process information corresponding to the process information number 2 is checked, and only the first bit of the matching process information index[]:[4,-1,-1,-1] is a non-initial value -1, and the remaining two bits are The data is all initial value -1, so the match is not completed yet.
处理匹配规则2,当前字符‘代’不是匹配规则2对应关键词的首字符,不新建匹配过程。而其对应关键词“找??代理”已有匹配过程信息1,因此,将用当前字符“代”在待匹配文本中的索引位4替换匹配过程信息1中对应元素的初始值,更新后的匹配过程信息表如表5所示: The matching rule 2 is processed, and the current character 'generation' is not the first character of the keyword corresponding to the matching rule 2, and the matching process is not newly created. The corresponding keyword "find the proxy" already has the matching process information 1. Therefore, the initial value of the corresponding element in the matching process information 1 is replaced by the index bit 4 in the text to be matched with the current character "generation". The matching process information table is shown in Table 5:
关键词Key words 匹配过程信息Matching process information 过程信息编号Process information number
找??代理Find? ? proxy index[]:[1,4,-1]Index[]:[1,4,-1] 11
代??开*发票generation? ? Open * invoice index[]:[4,-1,-1,-1]Index[]:[4,-1,-1,-1] 22
表5table 5
对匹配过程信息1进行校验,根据匹配过程信息index[]:[1,4,-1]中第1位与第2位计算第一距离,第一距离为4-1-1=2,对于匹配过程信息1对应的关键词“找??代理”,字符距离规则集合为[3,2,0];在该关键词中‘找’和‘代’之间的字符距离规则为2,表示两个字符之间的距离即第二距离等于2,因此第一距离与第二距离符合关键词“找??代理”中字符‘代’的字符距离规则。然而最末位为-1,因此暂未匹配成功。The matching process information 1 is verified, and the first distance is calculated according to the first bit and the second bit in the matching process information index[]:[1,4,-1], and the first distance is 4-1-1=2. For the keyword "find the proxy" corresponding to the matching process information 1, the character distance rule set is [3, 2, 0]; in this keyword, the character distance rule between 'find' and 'generation' is 2, The distance between the two characters is represented, that is, the second distance is equal to 2, so the first distance and the second distance are in accordance with the character distance rule of the character 'generation' in the keyword "finding agent". However, the last bit is -1, so the match is not successful.
检索字符‘理’,在待匹配文本中的索引位是5,Retrieve the character 'reason', the index bit in the text to be matched is 5,
从倒排索引表中找到对应的匹配规则7,当前字符‘理’不是匹配规则7对应关键词的首字符,不新建匹配过程。而其对应关键词“找??代理”已有匹配过程信息1,因此,将当前字符“理”在待匹配文本中的索引位5更新到匹配过程信息1中,更新后的匹配过程信息表如表6所示:The corresponding matching rule 7 is found from the inverted index table. The current character 'ration' is not the first character of the keyword corresponding to the matching rule 7, and no new matching process is created. The corresponding keyword "find the proxy" already has matching process information 1, so the index bit 5 of the current character "reason" in the text to be matched is updated to the matching process information 1, and the updated matching process information table As shown in Table 6:
关键词Key words 匹配过程信息Matching process information 过程信息编号Process information number
找??代理Find? ? proxy index[]:[1,4,5]Index[]:[1,4,5] 11
代??开*发票generation? ? Open * invoice index[]:[4,-1,-1,-1]Index[]:[4,-1,-1,-1] 22
表6Table 6
对匹配过程信息1进行校验,根据匹配过程信息index[]中第1位与第2位之间的第一距离为4-1-1=2,对于匹配过程信息1对应的关键词“找??代理”,字符距离规则集合为[3,2,0],在该关键词中‘找’和‘代’之间的字符距离规则为2,表示两个字符之间的距离即第二距离等于2,因此第一距离与第二距离符合关键词“找??代理”中字符‘代’的距离规则,采用上述相同的方法,第2位与第3位之间的距离为5-4-1=0,符合关键词“找??代理”中字符‘理’的距离规则,距离校验通过。同时,匹配过程信息所有位全部不为-1,因此匹配成功,根据匹配过程信息的第1位和最末位输出命中的字符串“找便宜代理”。The matching process information 1 is checked, and the first distance between the first bit and the second bit in the matching process information index[] is 4-1-1=2, for the keyword corresponding to the matching process information 1 ??Proxy", the set of character distance rules is [3, 2, 0], in which the character distance rule between 'find' and 'generation' is 2, indicating the distance between two characters is the second The distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'generation' in the keyword "finding agent". The same method is used, and the distance between the second and third places is 5- 4-1=0, in accordance with the distance rule of the character 'reason' in the keyword "find the agent", the distance check is passed. At the same time, all the bits of the matching process information are not -1, so the matching is successful, according to the string of the first and last bit of the matching process information hit "find cheap agent".
检索字符‘开’,在待匹配文本中的索引位是6,Retrieve the character 'on', the index bit in the text to be matched is 6,
从倒排索引表中找到对应的匹配规则3,当前字符‘开’不是匹配规则3对应关键词的首字符,不新建匹配过程。而其对应关键词“代??开*发票”已有匹配过程信息2,因此,将当前字符“开”在待匹配文本中的索引位6更新到匹配过程信息2中,更新后的匹配过程信息表如表7所示:The corresponding matching rule 3 is found from the inverted index table. The current character 'on' is not the first character of the keyword corresponding to the matching rule 3, and no new matching process is performed. And the corresponding keyword "generation? invoice" has matching process information 2, therefore, the index bit 6 of the current character "on" in the text to be matched is updated to the matching process information 2, and the updated matching process The information table is shown in Table 7:
关键词Key words 匹配过程信息Matching process information 过程信息编号Process information number
找??代理Find? ? proxy index[]:[1,4,5]Index[]:[1,4,5] 11
代??开*发票generation? ? Open * invoice index[]:[4,6,-1,-1]Index[]:[4,6,-1,-1] 22
表7Table 7
对匹配过程信息2进行校验,匹配过程信息index[]中第1位与第2位之间的第一距离为6-4-1=1,匹配过程信息2对应的关键词“代??开*发票”,字符距离规则集合为[4,2,-1,0];在该关键词中‘代’和‘开’之间的字符距离规则为2,表示两个字符之间的距离即第二距离等于2,因此第一距离与第二距离符合关键词“代??开*发票”中字符‘开’的距离规则,然而第3、4位为-1,因此暂未匹配成功。The matching process information 2 is checked, and the first distance between the first bit and the second bit in the matching process information index[] is 6-4-1=1, and the keyword corresponding to the process information 2 is replaced by the keyword "?? Open *invoice", the character distance rule set is [4,2,-1,0]; in this keyword, the character distance rule between 'generation' and 'on' is 2, indicating the distance between two characters That is, the second distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'on' in the keyword "on behalf of the open invoice", but the third and fourth digits are -1, so the match is not successful yet. .
检索字符‘营’,在待匹配文本中的索引位是7,从倒排索引表中没找到对应的匹配规则,忽略。The character ‘battalion' is retrieved, and the index bit in the text to be matched is 7, and the corresponding matching rule is not found from the inverted index table, and is ignored.
检索字符‘业’,在待匹配文本中的索引位是8,从倒排索引表中没找到对应的匹配规则,忽略。The character ' industry' is retrieved, and the index bit in the text to be matched is 8, and the corresponding matching rule is not found from the inverted index table, and is ignored.
检索字符‘发’,在待匹配文本中的索引位是9,从倒排索引表中找到对应的匹配规则4,当前字符‘发’不是匹配规则4对应关键词的首字符,不新建匹配过程。而其对应关键词“代??开*发票”已有匹配过程信息2,因此,将当前字符“发”在待匹配文本中的索引位9更新到匹配过程信息2中,更新后的匹配过程信息表如表8所示:The character 'fat' is retrieved, and the index bit in the text to be matched is 9, and the corresponding matching rule 4 is found from the inverted index table. The current character 'send' is not the first character of the keyword corresponding to the matching rule 4, and the new matching process is not created. . And the corresponding keyword "generation? invoice" has matching process information 2, therefore, the index bit 9 of the current character "send" in the text to be matched is updated to the matching process information 2, and the updated matching process The information table is shown in Table 8:
关键词Key words 匹配过程信息Matching process information 过程信息编号Process information number
找??代理Find? ? proxy index[]:[1,4,5]Index[]:[1,4,5] 11
代??开*发票generation? ? Open * invoice index[]:[4,6,9,-1]Index[]:[4,6,9,-1] 22
表8Table 8
对匹配过程信息2进行校验,匹配过程信息index[]中第1位与第2位之间的第一距离为6-4-1=1,匹配过程信息2对应的关键词“代??开*发票”,字符距离规则集合为[4,2,-1,0];在该关键词中‘代’和‘开’之间的字符距离规则为2,表示两个字符之间的距离即第二距离等于2,因此第一距离与第二距离符合关键词“代??开*发票”中字符‘开’的距离规则,第2位与第3位之间的第一距离为9-6-1=2,在该关键词中‘开’和‘发’之间的字符距离规则为-1,表示两个字符之间的距离即第二距离为任意间隔,此时表示添加成功,并且最末位为-1,因此暂未匹配完成。The matching process information 2 is checked, and the first distance between the first bit and the second bit in the matching process information index[] is 6-4-1=1, and the keyword corresponding to the process information 2 is replaced by the keyword "?? Open *invoice", the character distance rule set is [4,2,-1,0]; in this keyword, the character distance rule between 'generation' and 'on' is 2, indicating the distance between two characters That is, the second distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'on' in the keyword "generation? invoice", and the first distance between the second and third places is 9 -6-1=2, the character distance rule between 'on' and 'fat' in this keyword is -1, indicating that the distance between two characters is the second distance is an arbitrary interval, indicating that the addition is successful. And the last bit is -1, so the match is not completed yet.
检索字符‘票’,在待匹配文本中的索引位是10,从倒排索引表中找到对应的匹配规则5,当前字符‘票’不是匹配规则5对应关键词的首字符,不新建匹配过程。而其对应关键词“代??开*发票”已有匹配过程信息2,因此,将当前字符在待匹配文本中的索引位更新到匹配过程信息2中,更新后的匹配过程信息表如表9所示:The character 'ticket' is retrieved, and the index bit in the text to be matched is 10, and the corresponding matching rule 5 is found from the inverted index table. The current character 'ticket' is not the first character of the keyword corresponding to the matching rule 5, and the new matching process is not created. . And the corresponding keyword "generation? invoice" has matching process information 2, therefore, the index bit of the current character in the text to be matched is updated to the matching process information 2, and the updated matching process information table is as shown in the table. 9 shows:
关键词Key words 匹配过程信息Matching process information 过程信息编号Process information number
找??代理Find? ? proxy index[]:[1,4,5]Index[]:[1,4,5] 11
代??开*发票generation? ? Open * invoice index[]:[4,6,9,10]Index[]:[4,6,9,10] 22
表9Table 9
对匹配过程信息2进行校验,匹配过程信息index[]中第1、2位之间的第一距离为6-4-1=1,匹配过程信息2对应的关键词“代??开*发票”,字符距离规则集合为[4,2,-1,0];在该关键词中‘代’和‘开’之间的字符距离规则为2,表示两个字符之间的距离即第二距离等于2,因此第一距离与第二距离符合关键词“代??开*发票”中字符‘开’的距离规则,第2、3位之间的第一距离为9-6-1=2,在该关键词中‘开’和‘发’之间的字符距离规则为-1,表示两个字符之间的距离即第二距离为任一间隔,此时表示添加成功,符合关键词“代??开*发票”中字符‘发’的距离规则。第2、3位之间的第一距离为10-9-1=0,符合关键词“代??开*发票”中字符‘票’的距离规则。同时,匹配过程信息所有位都不是-1,因此匹配完成,根据匹配过程信息的第1位和最末位输出命中的字符串“代理开营业发票”。The matching process information 2 is checked, and the first distance between the first and second digits in the matching process information index[] is 6-4-1=1, and the keyword corresponding to the process information 2 is “delivered*” Invoice", the set of character distance rules is [4, 2, -1, 0]; in this keyword, the character distance rule between 'generation' and 'on' is 2, indicating the distance between two characters The second distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'on' in the keyword "generation? invoice", and the first distance between the second and third digits is 9-6-1. =2, in the keyword, the character distance rule between 'on' and 'send' is -1, indicating that the distance between two characters, that is, the second distance is any interval, indicating that the addition is successful, in line with the key The distance rule for the character 'send' in the word "??? The first distance between the 2nd and 3rd digits is 10-9-1=0, which conforms to the distance rule of the character 'ticket' in the keyword "generation? invoice". At the same time, all the bits of the matching process information are not -1, so the matching is completed, and the character string "proxy opening business invoice" is hit according to the first bit and the last bit of the matching process information.
另一方面,待匹配文本中可能存在有多个相同字符,并且该字符有所属关键词时的情况。以待匹配文本中有两个相同字符,并且该字符有所属关键词时,进行举例说明:On the other hand, there may be a case where there are a plurality of identical characters in the text to be matched, and the character has a keyword. If there are two identical characters in the text to be matched, and the character has a keyword, give an example:
比如关键词是“代??开*发票”,待匹配文本是“北京代理开营业开个税发票”。For example, the keyword is “generation??* invoice”, and the text to be matched is “Beijing agent opens a business to open a tax invoice”.
当扫描到待匹配文本中的第一个‘开’时,更新匹配过程信息,将匹配过程信息数组的第二位更新为4,即匹配过程信息数组为[2,4,-1,-1]。继续扫描到待匹配文本中的第二个‘开’时,发现已存在的匹配过程信息中匹配过程信息数组的第二位已有值(非初始值-1),此时复制一个新的匹配过程信息,其中的匹配过程信息数组[2,7,-1,-1]。When the first 'on' in the text to be matched is scanned, the matching process information is updated, and the second bit of the matching process information array is updated to 4, that is, the matching process information array is [2, 4, -1, -1 ]. When scanning to the second 'on' in the text to be matched, it is found that the second bit of the matching process information array in the existing matching process information has an existing value (non-initial value -1), and a new match is copied at this time. Process information, which matches the array of process information [2, 7, -1, -1].
另一方面,待匹配文本中可能存在有某一字符有所属关键词,但在所属关键词中是非首字符的情况,比如关键词是“代??开*发票”和“开*票”,待匹配文本是“北京开个税发票”。On the other hand, there may be cases in which the character to be matched has a certain keyword, but it is a non-first character in the keyword to be matched, for example, the keyword is “generation??* invoice” and “open* ticket”. The text to be matched is “Beijing Opens a Tax Invoice”.
当扫描到待匹配文本中的‘开’时,发现在所属关键词“代??开*发票”中不是首字符,忽略关键词“代??开*发票”,由于待匹配文本中的‘开’在所属关键词“开*票”中是首字符,因此,对待匹配文本中‘开’之后的其他字符进行匹配时只与关键词“开*票”进行匹配。When scanning to 'on' in the text to be matched, it is found that the keyword "generation? invoice" is not the first character, and the keyword "generation? invoice" is ignored, due to the 'in the text to be matched' The opening 'is the first character in the keyword "open * ticket". Therefore, when matching other characters after the 'on' in the matching text, only the keyword "open * ticket" is matched.
在本实施例中,通过拆分关键词,对每个字符建立匹配规则,建立倒排索引表,在对待匹配文本进行匹配时,通过过滤掉首字符不在待匹配文本中的关键词的方式,不必穷举所有的关键词,使匹配关键词操作更简便,通过对匹配过程信息校验,避免在匹配关键词时回溯以及关键词无效的问题,提高了匹配效率。In this embodiment, by splitting the keywords, a matching rule is established for each character, and an inverted index table is created. When the matching text is matched, the keyword whose first character is not in the text to be matched is filtered out. It is not necessary to exhaust all the keywords to make the matching keyword operation easier. By checking the matching process information, the problem of backtracking and keyword invalidation when matching keywords is avoided, and the matching efficiency is improved.
本实施例提供一种多关键词模糊匹配的装置,如图3所示,包括:获取模块301、判断模块302、第一标记模块303、第二标记模块304和确定模块305。 The embodiment provides a device for multi-keyword fuzzy matching. As shown in FIG. 3, the method includes: an obtaining module 301, a determining module 302, a first marking module 303, a second marking module 304, and a determining module 305.
所述获取模块301,用于对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;The obtaining module 301 is configured to obtain, for each character in the text to be matched, a keyword that belongs to the character and an index bit of the character in the keyword that belongs to the keyword according to the keyword set;
所述判断模块302,用于依据所述字符在所属关键词中的索引位判断所述字符是否为所属关键词的首字符;The determining module 302 is configured to determine, according to an index bit of the character in the keyword that belongs to the keyword, whether the character is the first character of the keyword that belongs to the keyword;
所述第一标记模块303,用于当所述判断模块的判断结果为是时,在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;The first marking module 303 is configured to: when the determination result of the determining module is yes, record the keyword to which the character belongs in the matching information set, and mark the first character of the keyword in the record In the matching text;
第二标记模块304,用于当所述判断模块的判断结果为否且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述待匹配文本中;a second marking module 304, configured to: when the determination result of the determining module is negative and the record of the keyword belongs to the matching information set, obtain a record of the keyword to which the character belongs, and Marking, in the record, the character in the keyword exists in the text to be matched;
所述确定模块305,用于当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。The determining module 305 is configured to determine that the to-be-matched text hits the keyword when each character in a keyword is marked in the to-be-matched text.
本申请优选实施例中,所述装置还可以包括:匹配规则构建模块、匹配规则集合构建模块和倒排索引构建模块。In a preferred embodiment of the present application, the apparatus may further include: a matching rule building module, a matching rule set building module, and an inverted index building module.
所述匹配规则构建模块,用于对所述关键词集合中每个关键词的每个字符,分别构建对应各字符的匹配规则;所述匹配规则包含:字符、所述字符所属关键词、所述字符所属关键词包括的字符数目、以及所述字符在所属关键词中的索引位;The matching rule construction module is configured to respectively construct a matching rule corresponding to each character for each character of each keyword in the keyword set; the matching rule includes: a character, a keyword of the character, and a The number of characters included in the keyword to which the character belongs, and the index bit of the character in the associated keyword;
所述匹配规则集合构建模块,用于获取关键词中每个字符对应的匹配规则,形成对应所述关键词的匹配规则集合;The matching rule set construction module is configured to acquire a matching rule corresponding to each character in the keyword, and form a matching rule set corresponding to the keyword;
所述倒排索引构建模块,用于根据所述匹配规则集合构建从所述字符到所述匹配规则的倒排索引表;所述倒排索引表包括:字符、以及所述字符对应的所有匹配规则。The inverted index construction module is configured to construct an inverted index table from the character to the matching rule according to the matching rule set; the inverted index table includes: a character, and all matches corresponding to the character rule.
本申请优选实施例中,所述获取模块301可以包括:遍历单元和第一获取单元。In the preferred embodiment of the present application, the obtaining module 301 may include: a traversal unit and a first acquiring unit.
所述遍历单元,用于遍历倒排索引表,当所述倒排索引表中包括所述字符时,获取所述字符对应的所有的匹配规则;The traversing unit is configured to traverse the inverted index table, and when the inverted index table includes the character, acquire all matching rules corresponding to the character;
所述第一获取单元,用于对每条匹配规则,分别获取所述匹配规则包括的所述字符所属关键词、以及所述字符在所属关键词中的索引位。The first obtaining unit is configured to acquire, for each matching rule, a keyword of the character included in the matching rule, and an index bit of the character in the associated keyword.
本申请优选实施例中,所述第一标记模块303可以包括:第一记录单元。In a preferred embodiment of the present application, the first marking module 303 may include: a first recording unit.
所述第一记录单元,用于当所述判断模块的判断结果为是时,新建所述字符所属关键词对应的匹配过程信息,将所述匹配过程信息保存在匹配过程信息集合中;并将所述字符在所述待匹配文本中的索引位记录到所述匹配过程信中。The first recording unit is configured to: when the determination result of the determining module is yes, create matching process information corresponding to the keyword to which the character belongs, and save the matching process information in the matching process information set; The index bit of the character in the text to be matched is recorded in the matching process letter.
本申请优选实施例中,所述第二标记模块304可以包括:第二记录单元。In a preferred embodiment of the present application, the second marking module 304 may include: a second recording unit.
所述第二记录单元,用于当所述判断模块的判断结果为否时,则查找匹配信息集合,判断所述匹配过程集合中是否存在所述字符所属关键词对应的匹配过程信息,如果存在,则获取所述字符所属关键词对应的匹配过程信息,并将所述字符在所述待匹配文本中的索引位记录到所述匹配过程信息中。 The second recording unit is configured to: when the determination result of the determining module is negative, search for a matching information set, and determine whether there is matching process information corresponding to the keyword to which the character belongs in the matching process set, if yes, And acquiring matching process information corresponding to the keyword to which the character belongs, and recording an index bit of the character in the to-be-matched text into the matching process information.
本申请优选实施例中,所述第二记录单元还可以包括:判断子单元、复制更新子单元和索引位记录子单元。In a preferred embodiment of the present application, the second recording unit may further include: a determining subunit, a copy updating subunit, and an index bit recording subunit.
所述判断子单元,用于判断所述字符所属关键词对应的匹配过程信息中是否已记录过所述字符在所述待匹配文本中的索引位;The determining subunit is configured to determine whether an index bit of the character in the to-be-matched text has been recorded in the matching process information corresponding to the keyword to which the character belongs;
所述复制更新子单元,用于当所述判断子单元的判断结果为是时,复制所述字符所属关键词对应的匹配过程信息,并用所述字符在所述待匹配文本中的当前索引位更新所述匹配过程信息中已记录的所述字符在所述待匹配文本中的索引位;The copy update subunit is configured to: when the judgment result of the judgment subunit is YES, copy the matching process information corresponding to the keyword to which the character belongs, and use the current index bit of the character in the to-be-matched text. Updating an index bit of the character recorded in the matching process information in the to-be-matched text;
所述索引位记录子单元,用于当所述判断子单元的判断结果为否时,执行获取所述字符所属关键词对应的匹配过程信息,并将所述字符在所述待匹配文本中的索引位记录到所述匹配过程信息中的步骤。The index bit recording subunit is configured to: when the determination result of the determining subunit is negative, perform matching process information corresponding to the keyword to which the character belongs, and the character is in the to-be-matched text. The index bit is recorded to the step in the matching process information.
本申请优选实施例中,所述一关键词中的各字符都被标记存在于所述待匹配文本中包括:所述字符所属关键词对应的匹配过程信息中的每一位都被置为对应位数的字符在所述待匹配文本中的索引位。In a preferred embodiment of the present application, each character in the keyword is marked in the to-be-matched text, and each bit in the matching process information corresponding to the keyword to which the character belongs is set to correspond. The index bit of the character of the number of bits in the text to be matched.
本申请优选实施例中,所述装置还可以包括:输出模块。In a preferred embodiment of the present application, the apparatus may further include: an output module.
所述输出模块,用于在所述确定模块判定所述待匹配文本命中所述关键词之后,根据各字符在待匹配文本中的索引位从所述待匹配文本中获取匹配信息,输出所述匹配信息。The output module is configured to: after the determining module determines that the to-be-matched text hits the keyword, obtain matching information from the to-be-matched text according to an index bit of each character in the text to be matched, and output the Match information.
本申请优选实施例中,所述装置还可以包括:字符距离构建模块和匹配校验模块。In a preferred embodiment of the present application, the apparatus may further include: a character distance construction module and a matching verification module.
所述字符距离构建模块,用于对所述关键词集合中的各关键词,分别构建字符距离规则,形成字符距离规则集合,所述字符距离规则包括:关键词包括的字符数量、所述关键词中各字符与其前一个字符的有效距离;The character distance construction module is configured to separately construct a character distance rule for each keyword in the keyword set to form a character distance rule set, where the character distance rule includes: a number of characters included in the keyword, the key The effective distance between each character in the word and its previous character;
所述匹配校验模块,用于当对待匹配文本进行匹配时,根据所述字符距离规则集合对当前字符所属关键词进行距离匹配校验。The matching check module is configured to perform distance matching check on the keyword to which the current character belongs according to the character distance rule set when the matching text is matched.
本申请优选实施例中,所述匹配校验模块可以包括:第二获取单元、第三获取单元、第一校验单元和第二校验单元。In a preferred embodiment of the present application, the matching verification module may include: a second obtaining unit, a third obtaining unit, a first checking unit, and a second checking unit.
所述第二获取单元,用于从所述当前字符所属关键词中获取当前字符的前一个字符在所述待匹配文本中的索引位,根据所述当前字符的前一个字符在所述待匹配文本中的索引位和所述当前字符在所述待匹配文本中的索引位,计算所述当前字符与所述当前字符的前一个字符之间的第一距离;The second obtaining unit is configured to obtain, from the keyword to which the current character belongs, an index bit of the previous character of the current character in the to-be-matched text, and the to-be-matched according to the previous character of the current character Calculating a first distance between the current character and a previous character of the current character by using an index bit in the text and an index bit of the current character in the to-be-matched text;
所述第三获取单元,用于从当前字符所属关键词对应的字符距离规则中获取所述当前字符与所述当前字符的前一个字符之间的有效距离作为第二距离;The third obtaining unit is configured to obtain, as a second distance, an effective distance between the current character and a previous character of the current character from a character distance rule corresponding to a keyword to which the current character belongs;
所述第一校验单元,用于如果所述第二距离表示任一间隔长度,或者所述第二距离大于所述第一距离,则表明距离校验成功,获取当前字符的下一个字符进行匹配; The first checking unit is configured to: if the second distance represents any interval length, or the second distance is greater than the first distance, indicating that the distance verification is successful, acquiring the next character of the current character Match
所述第二校验单元,用于如果所述第二距离表示不是任一间隔长度,并且所述第二距离小于所述第一距离,则表明所述距离校验失败,所述匹配过程信息失效,结束对所述字符所属关键词的匹配。The second check unit is configured to: if the second distance indicates that it is not any interval length, and the second distance is smaller than the first distance, indicating that the distance check fails, the matching process information Invalidation, ending the matching of the keywords to which the character belongs.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。With regard to the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment relating to the method, and will not be explained in detail herein.
图4是根据一示例性实施例示出的一种用于关键词模糊匹配的装置800的框图。例如,装置800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。FIG. 4 is a block diagram of an apparatus 800 for keyword fuzzy matching, according to an exemplary embodiment. For example, device 800 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
参照图4,装置800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。Referring to FIG. 4, apparatus 800 can include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, And a communication component 816.
处理组件802通常控制装置800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理部件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。 Processing component 802 typically controls the overall operation of device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 802 can include one or more processors 820 to execute instructions to perform all or part of the steps of the above described methods. Moreover, processing component 802 can include one or more modules to facilitate interaction between component 802 and other components. For example, processing component 802 can include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
存储器804被配置为存储各种类型的数据以支持在设备800的操作。这些数据的示例包括用于在装置800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。 Memory 804 is configured to store various types of data to support operation at device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phone book data, messages, pictures, videos, and the like. The memory 804 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
电力组件806为装置800的各种组件提供电力。电力组件806可以包括电源管理系统,一个或多个电源,及其他与为装置800生成、管理和分配电力相关联的组件。 Power component 806 provides power to various components of device 800. Power component 806 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 800.
多媒体组件808包括在所述装置800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen can include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当装置800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被 配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。The audio component 810 is configured to output and/or input an audio signal. For example, the audio component 810 includes a microphone (MIC) that is activated when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. Configured to receive external audio signals. The received audio signal may be further stored in memory 804 or transmitted via communication component 816. In some embodiments, the audio component 810 also includes a speaker for outputting an audio signal.
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 812 provides an interface between the processing component 802 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
传感器组件814包括一个或多个传感器,用于为装置800提供各个方面的状态评估。例如,传感器组件814可以检测到设备800的打开/关闭状态,组件的相对定位,例如所述组件为装置800的显示器和小键盘,传感器组件814还可以检测装置800或装置800一个组件的位置改变,用户与装置800接触的存在或不存在,装置800方位或加速/减速和装置800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。 Sensor assembly 814 includes one or more sensors for providing device 800 with a status assessment of various aspects. For example, sensor assembly 814 can detect an open/closed state of device 800, a relative positioning of components, such as the display and keypad of device 800, and sensor component 814 can also detect a change in position of one component of device 800 or device 800. The presence or absence of user contact with device 800, device 800 orientation or acceleration/deceleration, and temperature variation of device 800. Sensor assembly 814 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
通信组件816被配置为便于装置800和其他设备之间有线或无线方式的通信。装置800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信部件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信部件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。 Communication component 816 is configured to facilitate wired or wireless communication between device 800 and other devices. The device 800 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
在示例性实施例中,装置800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器804,上述指令可由装置800的处理器820执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium comprising instructions, such as a memory 804 comprising instructions executable by processor 820 of apparatus 800 to perform the above method. For example, the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
一种非临时性计算机可读存储介质,当所述存储介质中的指令由移动终端的处理器执行时,使得移动终端能够执行一种用于关键词模糊匹配的方法,所述方法包括:对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;依据所述字符在所属关键词中的索引位判断所述字符是否为所述关键词的首字符;如果所述字符是所属关键词的首字符,则在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;如果所述字符不是所属关键词的首字符,且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述 待匹配文本中;当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。A non-transitory computer readable storage medium, when instructions in the storage medium are executed by a processor of a mobile terminal, enabling the mobile terminal to perform a method for keyword fuzzy matching, the method comprising: treating Matching each character in the text, respectively acquiring a keyword to which the character belongs and an index bit of the character in the keyword according to the keyword set; determining the character according to the index bit of the character in the keyword Whether it is the first character of the keyword; if the character is the first character of the keyword, the keyword belonging to the character is recorded in the matching information set, and the first character of the keyword is marked in the record. In the text to be matched; if the character is not the first character of the keyword, and the record of the keyword to which the character belongs exists in the matching information set, the record of the keyword to which the character belongs is obtained, and The character in the keyword in the record is present in the In the text to be matched; when each character in a keyword is marked in the text to be matched, it is determined that the text to be matched hits the keyword.
图5是本发明实施例中服务器的结构示意图。该服务器1900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1922(例如,一个或一个以上处理器)和存储器1932,一个或一个以上存储应用程序1942或数据1944的存储介质1930(例如一个或一个以上海量存储设备)。其中,存储器1932和存储介质1930可以是短暂存储或持久存储。存储在存储介质1930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1922可以设置为与存储介质1930通信,在服务器1900上执行存储介质1930中的一系列指令操作。FIG. 5 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900 can vary considerably depending on configuration or performance, and can include one or more central processing units (CPUs) 1922 (eg, one or more processors) and memory 1932, one or one The above storage medium 1942 or storage medium 1930 of data 1944 (eg, one or one storage device in Shanghai). Among them, the memory 1932 and the storage medium 1930 may be short-term storage or persistent storage. The program stored on storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations in the server. Still further, central processor 1922 can be configured to communicate with storage medium 1930, which performs a series of instruction operations in storage medium 1930.
本发明实施例还提供了一种程序,包括可读代码,当所述可读代码在服务器上运行时,导致所述服务器执行本发明实施例所述任一个所述的关键词模糊匹配的方法。并且还提供了一种可读介质,其中存储了如本发明实施例所述的程序。The embodiment of the present invention further provides a program, including a readable code, when the readable code is run on a server, causing the server to perform the keyword fuzzy matching method according to any one of the embodiments of the present invention. . Also provided is a readable medium in which a program as described in an embodiment of the present invention is stored.
例如,图6示出了可以实现根据本发明的关键词模糊匹配方法的服务器。该服务器传统上包括处理器1610和以存储器1620形式的程序产品或者可读介质。存储器1620可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM或者ROM之类的电子存储器。存储器1620具有用于执行上述方法中的任何方法步骤的程序代码1631的存储空间1630。例如,用于程序代码的存储空间1630可以包括分别用于实现上面的方法中的各种步骤的各个程序代码1631。这些程序代码可以从一个或者多个程序产品中读出或者写入到这一个或者多个程序产品中。这些程序产品包括诸如存储卡之类的程序代码载体。这样的程序产品通常为如参考图7所述的便携式或者固定存储单元。该存储单元可以具有与图16的服务器中的存储器1620类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括可读代码1631’,即可以由例如诸如1610之类的处理器读取的代码,这些代码当由服务器运行时,导致该服务器执行上面所描述的方法中的各个步骤。For example, FIG. 6 shows a server that can implement the keyword fuzzy matching method according to the present invention. The server conventionally includes a processor 1610 and a program product or readable medium in the form of a memory 1620. The memory 1620 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, or a ROM. Memory 1620 has a memory space 1630 for program code 1631 for performing any of the method steps described above. For example, storage space 1630 for program code may include various program code 1631 for implementing various steps in the above methods, respectively. These program codes can be read from or written to one or more program products. These program products include program code carriers such as memory cards. Such a program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have a storage section, a storage space, and the like arranged similarly to the storage 1620 in the server of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes readable code 1631', i.e., code that can be read by, for example, a processor such as 1610, which when executed by the server causes the server to perform various steps in the methods described above.
服务器1900还可以包括一个或一个以上电源1926,一个或一个以上有线或无线网络接口1950,一个或一个以上输入输出接口1958,一个或一个以上键盘1956,和/或,一个或一个以上操作系统1941,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。 Server 1900 may also include one or more power sources 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941. For example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本发明的其它实施方案。本发明旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由下面的权利要求指出。 Other embodiments of the invention will be apparent to those skilled in the <RTIgt; The present invention is intended to cover any variations, uses, or adaptations of the present invention, which are in accordance with the general principles of the invention and include common general knowledge or common technical means in the art that are not disclosed in the present disclosure. . The specification and examples are to be considered as illustrative only,
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。It is to be understood that the invention is not limited to the details of the details of The scope of the invention is limited only by the appended claims.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims (14)

  1. 一种关键词模糊匹配的方法,其特征在于,包括:A method for keyword fuzzy matching, comprising:
    对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;For each character in the matched text, the keyword to which the character belongs and the index bit of the character in the associated keyword are respectively acquired according to the keyword set;
    依据所述字符在所属关键词中的索引位判断所述字符是否为所述关键词的首字符;Determining whether the character is the first character of the keyword according to an index bit of the character in the associated keyword;
    如果所述字符是所属关键词的首字符,则在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;If the character is the first character of the keyword, the keyword belonging to the character is recorded in the matching information set, and the first character of the keyword is marked in the record to be present in the text to be matched;
    如果所述字符不是所属关键词的首字符,且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述待匹配文本中;If the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, the record of the keyword to which the character belongs is obtained, and the key is marked in the record The character in the word exists in the text to be matched;
    当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。When each character in a keyword is marked in the to-be-matched text, it is determined that the to-be-matched text hits the keyword.
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 wherein the method further comprises:
    对所述关键词集合中每个关键词的每个字符,分别构建对应各字符的匹配规则;所述匹配规则包含:字符、所述字符所属关键词、所述字符所属关键词包括的字符数目、以及所述字符在所属关键词中的索引位;For each character of each keyword in the keyword set, a matching rule corresponding to each character is separately constructed; the matching rule includes: a character, a keyword to which the character belongs, and a number of characters included in the keyword to which the character belongs And an index bit of the character in the associated keyword;
    获取关键词中每个字符对应的匹配规则,形成对应所述关键词的匹配规则集合;Obtaining a matching rule corresponding to each character in the keyword to form a matching rule set corresponding to the keyword;
    根据所述匹配规则集合构建从字符到匹配规则的倒排索引表;所述倒排索引表包括:字符、以及所述字符对应的所有匹配规则。And constructing an inverted index table from the character to the matching rule according to the matching rule set; the inverted index table includes: a character, and all matching rules corresponding to the character.
  3. 如权利要求2所述的方法,其特征在于,所述根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位,包括:The method according to claim 2, wherein the obtaining, by the keyword set, the keyword to which the character belongs and the index bit of the character in the keyword, includes:
    遍历所述倒排索引表,当所述倒排索引表中包括所述字符时,获取所述字符对应的所有的匹配规则;Traversing the inverted index table, when the inverted index table includes the character, acquiring all matching rules corresponding to the character;
    对每条匹配规则,分别获取所述匹配规则包括的所述字符所属关键词、以及所述字符在所属关键词中的索引位。For each matching rule, the keyword of the character included in the matching rule and the index bit of the character in the associated keyword are respectively acquired.
  4. 如权利要求1所述的方法,其特征在于,所述如果所述字符是所属关键词的首字符,则在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中,包括:The method according to claim 1, wherein if said character is a first character of a keyword, the keyword to which the character belongs is recorded in the matching information set, and the keyword is marked in the record The first character exists in the text to be matched, including:
    如果所述字符是所属关键词的首字符,新建所述字符所属关键词对应的匹配过程信息,将所述匹配过程信息保存在所述匹配信息集合中;并将所述字符在所述待匹配文本中的索引位记录到所述匹配过程信息中。 If the character is the first character of the keyword, the matching process information corresponding to the keyword to which the character belongs is newly created, and the matching process information is saved in the matching information set; and the character is to be matched. The index bits in the text are recorded into the matching process information.
  5. 如权利要求4所述的方法,其特征在于,如果所述字符不是所属关键词的首字符,且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于待匹配文本中,包括:The method according to claim 4, wherein if the character is not the first character of the keyword and the record of the keyword belongs to the matching information set, the keyword of the character is acquired. Recording, and marking in the record that the character in the keyword exists in the text to be matched, including:
    如果所述字符不是所属关键词的首字符,则查找所述匹配信息集合,判断所述匹配过程集合中是否存在所述字符所属关键词对应的匹配过程信息,如果存在,则获取所述字符所属关键词对应的匹配过程信息,并将所述字符在所述待匹配文本中的索引位记录到所述匹配过程信息中。If the character is not the first character of the keyword, the matching information set is searched, and the matching process information corresponding to the keyword to which the character belongs is determined in the matching process set, and if yes, the character is acquired. The keyword corresponding matching process information is recorded, and an index bit of the character in the to-be-matched text is recorded into the matching process information.
  6. 如权利要求5所述的方法,其特征在于,当所述匹配过程集合中存在所述字符所属关键词对应的匹配过程信息时,所述方法还包括:The method according to claim 5, wherein when the matching process information corresponding to the keyword to which the character belongs is present in the matching process set, the method further includes:
    判断所述字符所属关键词对应的匹配过程信息中是否已记录过所述字符在所述待匹配文本中的索引位;Determining, in the matching process information corresponding to the keyword to which the character belongs, whether an index bit of the character in the to-be-matched text has been recorded;
    如果是,复制所述字符所属关键词对应的匹配过程信息,并用所述字符在所述待匹配文本中的当前索引位更新所述匹配过程信息中已记录的所述字符在所述待匹配文本中的索引位;If yes, the matching process information corresponding to the keyword to which the character belongs is copied, and the character that has been recorded in the matching process information is updated in the to-be-matched text by using the current index bit of the character in the to-be-matched text. Index bit in ;
    如果否,执行获取所述字符所属关键词对应的匹配过程信息,并将所述字符在所述待匹配文本中的索引位记录到所述匹配过程信息中的步骤。If not, performing the step of acquiring the matching process information corresponding to the keyword to which the character belongs, and recording the index bit of the character in the to-be-matched text into the matching process information.
  7. 如权利要求4所述的方法,其特征在于,所述一关键词中的各字符都被标记存在于所述待匹配文本中,包括:The method according to claim 4, wherein each of the characters in the keyword is marked in the text to be matched, including:
    所述字符所属关键词对应的匹配过程信息中的每一位都被置为对应位数的字符在所述待匹配文本中的索引位。Each bit in the matching process information corresponding to the keyword to which the character belongs is set as an index bit of the character corresponding to the number of bits in the text to be matched.
  8. 如权利要求7所述的方法,其特征在于,在所述判定所述待匹配文本命中所述关键词之后,所述方法还包括:根据各字符在所述待匹配文本中的索引位从所述待匹配文本中获取匹配信息,输出所述匹配信息。The method according to claim 7, wherein after the determining that the text to be matched hits the keyword, the method further comprises: according to an index bit of each character in the text to be matched The matching information is obtained in the matching text, and the matching information is output.
  9. 如权利要求1所述的方法,其特征在于,还包括:The method of claim 1 further comprising:
    对所述关键词集合中的各关键词,分别构建字符距离规则,形成字符距离规则集合;所述字符距离规则包括:关键词包括的字符数量、所述关键词中各字符与其前一个字符的有效距离;Constructing a character distance rule for each keyword in the keyword set to form a character distance rule set; the character distance rule includes: a number of characters included in the keyword, each character in the keyword and a previous character thereof effective distance;
    当对所述待匹配文本进行匹配时,根据所述字符距离规则集合对当前字符所属关键词进行距离匹配校验。When the matched text is matched, the distance matching check is performed on the keyword of the current character according to the character distance rule set.
  10. 如权利要求9所述的方法,其特征在于,所述根据所述字符距离规则集合对当前字符所属关键词进行匹配校验,包括:The method according to claim 9, wherein the matching and verifying the keywords of the current character according to the character distance rule set comprises:
    从所述当前字符所属关键词中获取当前字符的前一个字符在所述待匹配文本中的索引位,根据所述当前字符的前一个字符在所述待匹配文本中的索引位和所述当前字符在所述待匹配文本中的索引位,计算所述当前字符与所述当前字符的前一个字符之间的第一距离; Obtaining, from the keyword to which the current character belongs, an index bit of the previous character of the current character in the text to be matched, according to an index bit of the previous character of the current character in the text to be matched, and the current Calculating, by an index bit in the text to be matched, a first distance between the current character and a previous character of the current character;
    从当前字符所属关键词对应的字符距离规则中获取所述当前字符与所述当前字符的前一个字符之间的有效距离作为第二距离;Obtaining, by the character distance rule corresponding to the keyword of the current character, an effective distance between the current character and a previous character of the current character as the second distance;
    如果所述第二距离表示任一间隔长度,或者所述第二距离大于所述第一距离,则表明距离校验成功,获取当前字符的下一个字符进行匹配;If the second distance indicates any interval length, or the second distance is greater than the first distance, it indicates that the distance verification is successful, and the next character of the current character is obtained for matching;
    如果所述第二距离表示不是任一间隔长度,并且所述第二距离小于所述第一距离,则表明所述距离校验失败,所述匹配过程信息失效,结束对所述字符所属关键词的匹配。If the second distance indicates that it is not any interval length, and the second distance is smaller than the first distance, it indicates that the distance check fails, the matching process information is invalid, and the keyword belonging to the character is ended. Match.
  11. 一种关键词模糊匹配的装置,其特征在于,包括:A device for fuzzy matching of keywords, comprising:
    获取模块,用于对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;An obtaining module, configured to acquire each keyword in the text, and obtain, according to the keyword set, a keyword to which the character belongs and an index bit of the character in the associated keyword;
    判断模块,用于依据所述字符在所属关键词中的索引位判断所述字符是否为所属关键词的首字符;a determining module, configured to determine, according to an index bit of the character in the keyword, whether the character is the first character of the keyword;
    第一标记模块,用于当所述判断模块的判断结果为是时,在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;a first marking module, configured to: when the determination result of the determining module is YES, record a keyword to which the character belongs in the matching information set, and mark, in the record, that the first character of the keyword exists in the to-be-matched In the text;
    第二标记模块,用于当所述判断模块的判断结果为否且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述待匹配文本中;a second marking module, configured to: when the determination result of the determining module is negative and the record of the keyword belongs to the matching information set, obtain a record of the keyword to which the character belongs, and record the record Marking the character in the keyword in the text to be matched;
    确定模块,用于当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。And a determining module, configured to: when each character in a keyword is marked in the to-be-matched text, determine that the to-be-matched text hits the keyword.
  12. 一种用于关键词模糊匹配的装置,其特征在于,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:An apparatus for keyword fuzzy matching, comprising: a memory, and one or more programs, wherein one or more programs are stored in a memory and configured to be executed by one or more processors The one or more programs include instructions for performing the following operations:
    对待匹配文本中的每个字符,分别根据关键词集合获取所述字符所属的关键词以及所述字符在所属关键词中的索引位;For each character in the matched text, the keyword to which the character belongs and the index bit of the character in the associated keyword are respectively acquired according to the keyword set;
    依据所述字符在所属关键词中的索引位判断所述字符是否为所述关键词的首字符;Determining whether the character is the first character of the keyword according to an index bit of the character in the associated keyword;
    如果所述字符是所属关键词的首字符,则在匹配信息集合中记录所述字符所属关键词,并在记录中标记所述关键词的首字符存在于所述待匹配文本中;If the character is the first character of the keyword, the keyword belonging to the character is recorded in the matching information set, and the first character of the keyword is marked in the record to be present in the text to be matched;
    如果所述字符不是所属关键词的首字符,且所述匹配信息集合中存在所述字符所属关键词的记录时,获取所述字符所属关键词的记录,并在所述记录中标记所述关键词中的所述字符存在于所述待匹配文本中;If the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, the record of the keyword to which the character belongs is obtained, and the key is marked in the record The character in the word exists in the text to be matched;
    当一关键词中的各字符都被标记存在于所述待匹配文本中时,则判定所述待匹配文本命中所述关键词。When each character in a keyword is marked in the to-be-matched text, it is determined that the to-be-matched text hits the keyword.
  13. 一种程序,包括可读代码,当所述可读代码在服务器上运行时,导致所述服务器执行根据权利要求1-10中的任一个所述的关键词模糊匹配的方法。A program comprising readable code that, when run on a server, causes the server to perform a method of keyword fuzzy matching according to any of claims 1-10.
  14. 一种可读介质,其中存储了如权利要求13所述的程序。 A readable medium storing the program of claim 13.
PCT/CN2016/104693 2015-12-11 2016-11-04 Fuzzy keyword matching method and apparatus WO2017097075A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510921094.0A CN105550298B (en) 2015-12-11 2015-12-11 Keyword fuzzy matching method and device
CN201510921094.0 2015-12-11

Publications (1)

Publication Number Publication Date
WO2017097075A1 true WO2017097075A1 (en) 2017-06-15

Family

ID=55829487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/104693 WO2017097075A1 (en) 2015-12-11 2016-11-04 Fuzzy keyword matching method and apparatus

Country Status (2)

Country Link
CN (1) CN105550298B (en)
WO (1) WO2017097075A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783607A (en) * 2018-12-19 2019-05-21 南京莱斯信息技术股份有限公司 A method of the match cognization magnanimity keyword in any text
CN109977422A (en) * 2019-04-18 2019-07-05 中国石油大学(华东) A kind of case history key message extraction model based on participle technique
CN110134686A (en) * 2019-05-07 2019-08-16 浪潮软件集团有限公司 A kind of index creation method and system of Chinese key fuzzy query
CN112052413A (en) * 2020-08-28 2020-12-08 上海谋乐网络科技有限公司 URL fuzzy matching method, device and system
CN115210708A (en) * 2019-08-07 2022-10-18 齐纳特科技公司 Data entry features for information tracking systems

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550298B (en) * 2015-12-11 2019-12-10 北京搜狗科技发展有限公司 Keyword fuzzy matching method and device
CN106649427B (en) * 2016-08-08 2020-07-03 中国移动通信集团湖北有限公司 Information identification method and device
CN109635009B (en) * 2018-12-27 2023-09-15 北京航天智造科技发展有限公司 Fuzzy matching inquiry system
CN110008383B (en) * 2019-04-11 2021-07-27 北京安护环宇科技有限公司 Black and white list retrieval method and device based on multiple indexes
CN110442570B (en) * 2019-06-06 2021-08-17 北京左江科技股份有限公司 BitMap high-speed fuzzy search method
CN113420192B (en) * 2021-06-09 2022-04-05 湖南大学 UI element searching method based on fuzzy matching

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169341A1 (en) * 2008-12-30 2010-07-01 Ebay Inc. Predictive algorithm for search box auto-complete
CN102323929A (en) * 2011-08-23 2012-01-18 上海粱江通信技术有限公司 Method for realizing fuzzy matching of Chinese short message with keyword
CN103902714A (en) * 2014-04-03 2014-07-02 北京国双科技有限公司 Method and device for filtering keywords
CN104598464A (en) * 2013-10-31 2015-05-06 联想(北京)有限公司 Information processing method and electronic equipment
CN104750673A (en) * 2013-12-31 2015-07-01 中国移动通信集团公司 Text matching and filtering method and text matching and filtering device
CN105550298A (en) * 2015-12-11 2016-05-04 北京搜狗科技发展有限公司 Keyword fuzzy matching method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932421A (en) * 2012-09-28 2013-02-13 中国联合网络通信集团有限公司 Cloud back-up method and device
CN104602206A (en) * 2014-12-31 2015-05-06 上海大汉三通通信股份有限公司 Spam message identification method and system
CN105205048B (en) * 2015-10-21 2018-05-04 迪爱斯信息技术股份有限公司 A kind of hot word analytic statistics system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169341A1 (en) * 2008-12-30 2010-07-01 Ebay Inc. Predictive algorithm for search box auto-complete
CN102323929A (en) * 2011-08-23 2012-01-18 上海粱江通信技术有限公司 Method for realizing fuzzy matching of Chinese short message with keyword
CN104598464A (en) * 2013-10-31 2015-05-06 联想(北京)有限公司 Information processing method and electronic equipment
CN104750673A (en) * 2013-12-31 2015-07-01 中国移动通信集团公司 Text matching and filtering method and text matching and filtering device
CN103902714A (en) * 2014-04-03 2014-07-02 北京国双科技有限公司 Method and device for filtering keywords
CN105550298A (en) * 2015-12-11 2016-05-04 北京搜狗科技发展有限公司 Keyword fuzzy matching method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783607A (en) * 2018-12-19 2019-05-21 南京莱斯信息技术股份有限公司 A method of the match cognization magnanimity keyword in any text
CN109977422A (en) * 2019-04-18 2019-07-05 中国石油大学(华东) A kind of case history key message extraction model based on participle technique
CN110134686A (en) * 2019-05-07 2019-08-16 浪潮软件集团有限公司 A kind of index creation method and system of Chinese key fuzzy query
CN115210708A (en) * 2019-08-07 2022-10-18 齐纳特科技公司 Data entry features for information tracking systems
CN115210708B (en) * 2019-08-07 2023-09-01 齐纳特科技公司 Method and system for processing text data, and non-transitory computer readable medium
US11783127B2 (en) 2019-08-07 2023-10-10 Zinatt Technologies, Inc. Data entry feature for information tracking system
CN112052413A (en) * 2020-08-28 2020-12-08 上海谋乐网络科技有限公司 URL fuzzy matching method, device and system
CN112052413B (en) * 2020-08-28 2024-02-13 上海谋乐网络科技有限公司 URL fuzzy matching method, device and system

Also Published As

Publication number Publication date
CN105550298A (en) 2016-05-04
CN105550298B (en) 2019-12-10

Similar Documents

Publication Publication Date Title
WO2017097075A1 (en) Fuzzy keyword matching method and apparatus
US10142351B1 (en) Retrieving contact information based on image recognition searches
CN107102746B (en) Candidate word generation method and device and candidate word generation device
CN107357779B (en) A kind of method and device obtaining organization names
CN109522419B (en) Session information completion method and device
WO2017157040A1 (en) Search method and device, and device used for searching
KR102138184B1 (en) Using metadata to summarize social media content
US20180039647A1 (en) Suggesting Filters for Search on Online Social Networks
WO2017143930A1 (en) Method of sorting search results, and device for same
JP2013510368A (en) Text filtering method and system
RU2673401C2 (en) Method and device for obtaining certification document
JP5989781B2 (en) Method and apparatus for progressive pattern matching in a mobile environment
KR102046582B1 (en) Method and apparatus for providing call log in electronic device
CN108427761B (en) News event processing method, terminal, server and storage medium
CN108027820A (en) For producing phrase blacklist to prevent some contents from appearing in the method and system in search result in response to search inquiry
KR20160059455A (en) Method and device for identifying encoding of web page
WO2017016384A1 (en) Short message processing method, information processing method and device, mobile terminal and storage medium
EP3387556B1 (en) Providing automated hashtag suggestions to categorize communication
CN109471919B (en) Zero pronoun resolution method and device
WO2017107708A1 (en) User proxy self-adaptation uniform resource locator prefix mining method and device
CN109783244B (en) Processing method and device for processing
CN110928425A (en) Information monitoring method and device
CN107229698B (en) A kind of method and device of information processing
CN109189824B (en) Method and device for retrieving similar articles
CN110020082B (en) Searching method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16872267

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16872267

Country of ref document: EP

Kind code of ref document: A1