WO2017097075A1

WO2017097075A1 - Fuzzy keyword matching method and apparatus

Info

Publication number: WO2017097075A1
Application number: PCT/CN2016/104693
Authority: WO
Inventors: 李剑; 毛宏
Original assignee: 北京搜狗科技发展有限公司
Priority date: 2015-12-11
Filing date: 2016-11-04
Publication date: 2017-06-15
Also published as: CN105550298A; CN105550298B

Abstract

A fuzzy keyword matching method and apparatus. The method comprises: acquiring a keyword to which each character in a to-be-matched text pertains, and an index bit of the character in the pertaining keyword according to a keyword set respectively (101); judging the index bit of the character in the pertaining keyword (102); if the character is a first character, recording the pertaining keyword of the character, and marking the first character in the keyword into the to-be-matched text (103); and if the character is not the first character and when recorded keywords exist, searching the recorded keywords for the pertaining keyword of the character, and marking the character in the keyword into the to-be-matched text (104). When each character in a keyword is marked into the to-be-matched text, it is determined that the to-be-matched text hits the keyword. The method and apparatus can overcome the defect in the prior art of low keyword matching efficiency to a certain extent.

Description

Method and device for keyword fuzzy matching

The present application claims the entire priority of the invention patent application filed in the Chinese application No. 201510921094.0, the filing date is December 11, 2015, and the invention title is "a method and device for fuzzy matching of keywords".

Technical field

The present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for keyword fuzzy matching.

Background technique

With the rapid development of Internet technology, network communication has also developed rapidly. Network communication is rapidly infiltrating into the political, economic, ideological and cultural fields of the world with irresistible momentum, changing people's lives and changing the face of society. However, online communication inevitably brings a series of negative effects, such as ideology and cultural penetration, fake news and information dissemination, and violation of personal privacy.

In the process of managing network communication, it usually involves text content analysis, network public opinion monitoring, intrusion detection and information retrieval. Keyword fuzzy matching technology is widely used in the above fields.

In the prior art, when a keyword is fuzzy matched, a regular expression is usually used, that is, a single pattern string is used to describe and match a series of strings conforming to the characteristic rule, and the regular expression is generally compiled into a regular expression by the regular matching engine. Non-deterministic finite automaton (NFA) or deterministic finite automaton (DFA), then scan and match text, usually regular expressions are verbose, and the efficiency of matching keywords is low, and the matching process takes a long time. Inefficient, resulting in wasted network resources.

Summary of the invention

The invention provides a method for keyword fuzzy matching, which can solve the defect that the existing keyword matching efficiency is low to a certain extent.

The invention provides the following solutions:

A method for fuzzy matching of a keyword, comprising: obtaining, for each character in the matched text, a keyword to which the character belongs according to the keyword set and an index bit of the character in the associated keyword; Determining, in the keyword of the keyword, whether the character is the first character of the keyword; if the character is the first character of the keyword, recording the keyword of the character in the matching information set, and The first character of the keyword in the record is stored in the to-be-matched text; if the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, the acquisition is performed. a record of the keyword to which the character belongs, and marking the character in the keyword in the record in the text to be matched; when each character in a keyword is marked in the to-be-presented When the text is matched, it is determined that the text to be matched hits the keyword.

An apparatus for fuzzy matching of a keyword, comprising: an obtaining module, configured to acquire each keyword in the text to be matched, and respectively obtain a keyword to which the character belongs according to the keyword set and an index bit of the character in the keyword to be associated a judging module, configured to determine, according to the index bit of the character in the keyword, whether the character is the first character of the keyword, and the first tagging module, when the determining result of the determining module is yes, Recording, in the matching information set, the keyword to which the character belongs, and marking the first character of the keyword in the record in the text to be matched; the second marking module is configured to: when the determining result of the determining module is If the record of the keyword to which the character belongs exists in the matching information set, the record of the keyword to which the character belongs is obtained, and the character in the keyword is marked in the record to be present in the record In the matching text, the determining module is configured to determine that the text to be matched hits when each character in a keyword is marked in the text to be matched Keywords above.

An apparatus for keyword fuzzy matching, comprising a memory, and one or more programs, wherein one or more programs are stored in a memory and configured to execute the one or more by one or more processors More than one program includes instructions for: each character in the text to be matched, respectively acquiring a keyword to which the character belongs and an index bit of the character in the associated keyword according to the keyword set; Determining, by the index bit of the character in the keyword, whether the character is the first character of the keyword; if the character is the first character of the keyword, the keyword of the character is recorded in the matching information set, and The first character of the keyword is recorded in the record, and if the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, Obtaining a record of a keyword to which the character belongs, and marking the word in the keyword in the record Present in the text to be matched; when a keyword tag of each character are present in the text to be matched, she is determined that the text to be matched to the hit keyword.

A program comprising readable code that, when executed on a server, causes the server to perform a method of keyword fuzzy matching according to any of the embodiments of the present invention.

A readable medium in which the program described in the embodiments of the present invention is stored.

According to a specific embodiment provided by the present invention, the present invention discloses the following technical effects:

By using a keyword fuzzy matching method of the present invention, each keyword in the acquired text to be matched is respectively obtained as a keyword to which the character belongs, and the index bit of the character in the associated keyword is determined, if If it is the first character, the keyword of the character is recorded, and the first character of the keyword is found in the text to be matched; if it is not the first character, the keyword of the character is searched for from the recorded keyword. The character in the keyword is present in the text to be matched; when each character in a keyword is marked in the text to be matched, the keyword is hit by the text to be matched. By this method, each character in the keyword is matched, and the keyword whose first character is not in the text to be matched is filtered out, and each keyword is not exhausted, and the keyword can be quickly hit, saving time and improving efficiency.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the embodiments will be briefly described below. Obviously, the drawings in the following description are only some of the present invention. For the embodiments, those skilled in the art can obtain other drawings according to the drawings without any creative work.

FIG. 1 is a flowchart of a method for keyword fuzzy matching according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for performing keyword fuzzy matching for each character in a text to be matched according to an embodiment of the present invention; FIG.

3 is a block diagram of a multi-keyword fuzzy matching device according to an embodiment of the present invention;

4 is a block diagram of an apparatus for keyword fuzzy matching, according to an exemplary embodiment;

5 is a schematic structural diagram of a server in an embodiment of the present invention;

6 shows a block diagram of a server for performing a multi-keyword fuzzy matching method according to the present invention;

Fig. 7 shows a storage unit for holding or carrying program code implementing the multi-keyword fuzzy matching method according to the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present invention are within the scope of the present invention.

The invention provides a method for keyword fuzzy matching, as shown in FIG. 1 , comprising:

Step S101: acquiring, for each character in the matched text, a keyword to which the character belongs and an index bit of the character in the associated keyword according to the keyword set;

Specifically, the text to be matched is scanned, and each time a character is scanned, the keyword to which the character belongs is obtained from the keyword set, and an index bit of the character in the keyword is obtained. One character in the text to be matched may correspond to one or more keywords, or may not have corresponding keywords.

In this embodiment, the method further includes: constructing, for each character of each keyword in the keyword set, a matching rule corresponding to each character, where the matching rule includes: a character, a keyword belonging to the character, and a character included in the keyword belonging to the character. a number, and an index bit of the character in the keyword; obtaining a matching rule corresponding to each character in the keyword, forming a matching rule set corresponding to the keyword; constructing an inverted row from the character to the matching rule set according to the matching rule set direction chart. The inverted index table includes: a character, and all matching rules corresponding to the character; based on the foregoing, the matching rule corresponding to the character may refer to a matching rule including the character.

When scanning any one of the characters to be matched, the keyword to which the character belongs and the index bit of the character in the associated keyword are obtained according to the inverted index table. Specifically, each character in the matched text is traversed by the inverted index table, and when the inverted index table includes the character, all matching rules corresponding to the character are obtained, and For each matching rule, the keyword of the character included in the matching rule and the index bit of the character in the associated keyword are respectively obtained.

In the embodiment of the present application, by splitting each character in the keyword and establishing a matching rule for each character, an inverted index table is created, so that all the key points corresponding to the character can be quickly obtained by traversing the inverted index table. Words, as well as determining the index bits of the characters in the respective keywords, no longer need to match the respective keywords, making the matching process easier, faster, and more efficient.

The keywords may include wildcards and non-wildcards. In this embodiment, non-wildcards are collectively referred to as characters. The keywords may contain one or several wildcards. These wildcards may be consecutive or intervald, and the length of the wildcard. Can be any character length. In this embodiment, the keyword set may be added, modified, and deleted. In response, when the keyword set is added, modified, and deleted, the content in the inverted index table is updated according to a specific operation. .

Step S102: determining whether the character is the first character of the keyword according to the index bit of the character in the keyword, if the character is the first character of the keyword, step S103 is performed; if it is not the first character, Go to step S104;

Specifically, the characters in the keyword are divided into a first character and a non-first character, wherein the first character refers to the first non-wildcard in the keyword. Non-first characters refer to non-wildcards in the keyword other than the first character. When there is only one non-wildcard in the keyword, the character is the first character.

Step S103: Record the keyword to which the character belongs in the matching information set, and mark the first character of the keyword in the record to be in the text to be matched;

Specifically, in step S103, if the character is the first character of the keyword, the matching process information corresponding to the keyword to which the character belongs is newly created, and the matching process information is saved in the matching information set; and the character is in the to-be-matched text. The index bit in the record is recorded in the matching process information. The matching process information is in one-to-one correspondence with the keywords to which the characters belong.

Further, the index bit of the character in the text to be matched is recorded into the matching process information, and the character in the belonging keyword can be marked to exist in the text to be matched, and the index bit of the character in the text to be matched is recorded. Used to output matching information after subsequent hits of keywords.

Step S104: When there is a record of the keyword to which the character belongs in the matching information set, the record of the keyword to which the character belongs is obtained, and the character in the marked keyword is present in the text to be matched in the record. Specifically, in step S104, if the character is not the first character of the keyword, the matching information set is searched, and it is determined whether the record of the keyword belongs to the matching information set, and if yes, the character is acquired. The record of the keyword, and the character in the markup keyword exists in the text to be matched in the record; if it does not exist, the character is ignored, and the next character in the text to be matched is continuously scanned.

In the embodiment of the present application, in step S104, the determining whether the record of the keyword belongs to the matching information set may be used to determine whether the matching of the keyword belongs to the matching process set. The process information, if any, obtains the matching process information corresponding to the keyword to which the character belongs, and records the index bit of the character in the text to be matched into the matching process information; if not, the character is ignored.

Specifically, when it is determined that the character is a non-first character, it is determined whether the keyword has been recorded by determining whether there is matching process information corresponding to the keyword to which the character belongs in the matching information set.

If there is matching process information corresponding to the keyword to which the character belongs, if it is determined that the keyword has been recorded, the record of the keyword to which the character belongs may be obtained, which may be the matching process information corresponding to the keyword to which the character belongs. The index bit of the character in the text to be matched is recorded in the matching process information.

If there is no matching process information corresponding to the keyword to which the character belongs, it is determined that the keyword has not been recorded, and the character is not the first character of the keyword, then the character can be ignored and obtained from the text to be matched. The next character is matched.

It should be noted that the matching process information is in one-to-one correspondence with the keyword to which the character belongs, and the number of bits included in each matching process information is the same as the number of characters included in the corresponding keyword; each of the matching process information A character used to mark the corresponding number of bits in the corresponding keyword respectively appears in the text to be matched. If it occurs, the corresponding bit in the matching process information is set to the index bit of the character corresponding to the number of bits in the text to be matched. .

It can be seen that when each bit in a matching process information is set to the index bit of the character corresponding to the number of bits in the text to be matched, it can be determined that each character of the keyword corresponding to the matching process information is Appearing in the text to be matched indicates that the keyword is hit by the text to be matched. This specific matching process will be described in detail in the subsequent embodiments.

Further, when it is determined that the keyword is hit by the to-be-matched text, the method may further include: outputting the matching information. Specifically, when it is determined that the index bits of all the characters in the corresponding keyword in the matching text are recorded in the matching process information, the text to be matched is determined to hit the keyword, and the character may be in the text to be matched according to each character. The index bit obtains matching information from the text to be matched, and outputs matching information.

In the embodiment of the present application, by using the matching process information, it can be determined whether each character in the keyword exists in the text to be matched, and an index bit in the text to be matched.

In this embodiment, when each character in a keyword is marked in the text to be matched, it is determined that the text to be matched hits the keyword. Specifically, when the index of the characters in the to-be-matched text is recorded in the matching process information corresponding to the keyword to which the character belongs, it is determined that the to-be-matched text hits the keyword.

Optionally, in this embodiment, in step S104, after the matching process information corresponding to the keyword to which the character belongs is obtained, before the index bit of the character in the text to be matched is recorded in the matching process information, The method further includes: determining whether an index bit of the character in the text to be matched has been recorded in the matching process information, and if the record is already, copying the matching process information, and updating the current index bit in the text to be matched with the character And the step of recording the index bit of the character in the text to be matched in the copied matching process information; if there is no record, performing the step of recording the index bit of the character in the text to be matched into the matching process information.

In this embodiment, the method further includes: constructing a character distance rule including the number of characters in the keyword and the effective distance between each character and the previous character for each keyword in the keyword set to form a character distance rule set; When the text is matched, the matching and verifying of the keyword of the current character according to the character distance rule set includes: obtaining an index bit of the previous character of the current character in the text to be matched from the keyword of the current character, according to the index The index bit of the bit and the current character in the text to be matched, the first distance between the current character and the previous character of the current character is calculated; and the current character and the current character are obtained from the character distance rule corresponding to the keyword of the current character. a second distance between characters; if the second distance represents any interval length, or the second distance represents an interval length greater than the interval length indicated by the first distance, the verification result is that the addition is successful, and the next character of the current character is obtained. The character matches the keyword it belongs to; if the second distance indicates that it is not any interval length, and the interval length indicated by the second distance is smaller than the interval length indicated by the first distance, the verification result is a failure, the matching process information is invalid, and the end is The match of the keyword to which the character belongs.

On the other hand, in combination with the above technical solutions, this embodiment provides a specific example for description, as follows:

Each character of each keyword in the keyword set may be separately constructed to include the character, the keyword to which the character belongs, the number of characters included in the keyword to which the character belongs, and the index bit of the character in the keyword to which the keyword belongs. Matching rules to form a matching rule set; constructing an inverted index table from a character to a matching rule set according to the matching rule set.

Specifically, each character of each keyword in the keyword set is respectively constructed to include the character, all the keywords of the character, the number of characters included in the keyword to which the character belongs, and the character in each keyword. The matching rule of the index bit, each character of each keyword in the keyword set has at least one matching rule, and the matching rule corresponding to all the characters in each keyword in the keyword set constitutes a matching rule set, All non-repeating characters construct this character into the inverted index table of the matching rule set.

For each keyword in the keyword set, a character distance rule including the number of characters in the keyword and the effective distance between each character and the previous character is constructed to form a character distance rule set. Each keyword corresponds to a character distance rule. The number of characters refers to the number of non-wildcards included in the keyword. The effective distance between each character and its previous character refers to the distance between each non-wildcard and its previous non-wildcard.

Each character can be continuous with its previous character. For example, the character distance can be set to 0 when continuous.

Each character and its previous character can also be discontinuous. The case of discontinuity can be divided into two types: one case: any length interval, which can include a wildcard "*" indicating an arbitrary length between the character and the previous character, and the character distance can be set to -1 at this time; Another case: fixed length interval, which can include n wildcard characters "??" between the character and the previous character. In this case, you can set the character distance to n (n is a natural number), for example, characters. When the wildcard "??" is included with the previous character, the corresponding character distance is 2.

In this embodiment, the keyword can be verified by using the character distance rule. The character distance rule set can also be generated when the keyword is matched and checked.

For example, the keyword set contains two keywords, of which the keyword one is: generation? ? Open * invoice, keyword two: find? ? proxy.

First, a matching rule is constructed for each character in each keyword, and the matching rule includes the character, the keyword to which the character belongs, the number of characters included in the keyword to which the character belongs, and the index bit of the character in the keyword; The matching rules of each character in the keyword one and the keyword two are constructed, and the matching rule set is formed. As shown in Table 1, the mapping relationship between each character and the matching rule set is constructed and numbered, and the inverted index table is obtained. The details are shown in Table 2.

Table 1

The following is an example of a matching rule set as shown in Table 1. For example, the keyword "generation? open * invoice", which includes 4 valid characters, corresponding to the construction of 4 matching rules.

The first character (character) "generation", the corresponding keyword (keyword) is "generation? open * invoice", the keyword includes the number of characters (size) is 4, the "generation" index in the keyword (index) is 0.

The second character (character) is "on", and the corresponding keyword (keyword) is "generation? open * invoice". The keyword includes the number of characters (4), and the "open" index position in the keyword. (index) is 1.

The third character (character) is "sent", and the corresponding keyword (keyword) is "generation? open * invoice", the keyword includes the number of characters (size) is 4, "send" the index bit in the keyword (index) is 2.

The fourth character (character) "voucher", the corresponding keyword (keyword) is "generation? open * invoice", the keyword includes the number of characters (size) is 4, the index of the "ticket" in the keyword (index) is 3.

Table 2

The following is an example of the inverted index table in conjunction with Table 2.

For example, the characters "generation" exist in the keyword "generation? open * invoice" and the keyword "find the agent", so corresponding to the two matching rules, respectively construct the character "generation" and two matching rules The mapping relationship between them, and numbered, is stored in the inverted index table, as shown in Table 2.

For another example, the character "on" exists only in the keyword "generation? open * invoice", so corresponding to a matching rule, the mapping relationship between the character "open" and the matching rule is constructed, and numbered, saved in In the inverted index table, as shown in Table 2.

Further, the character distance rule described in the foregoing embodiment of the present application is exemplified.

For keyword one: generation? ? Open * invoice

Among them, the number of characters is four, and there is two wildcard characters "??" between the second character "on" and the first character "generation", that is, the character distance is 2. There is a wildcard "*" of any length between the third character "hair" and the second character "on", that is, the character distance is -1, and the fourth character "ticket" and the third character "fat" There is no wildcard between them, that is, the character distance is 0. Therefore, the character distance rule corresponding to the constructed keyword is [4, 2, -1, 0].

For keyword two: find? ? proxy

According to the method of constructing the character distance rule for the keyword one, the character distance rule corresponding to the constructed keyword two is [3, 2, 0].

At this time, the character distance rule set contains two character distance rules, which are the character distance rule [4, 2, -1, 0] and the keyword "search" of the keyword "generation? The proxy's character distance rule [3, 2, 0].

When it is necessary to perform keyword fuzzy matching on the matched text, each character in the matched text is respectively subjected to keyword fuzzy matching, as shown in FIG. 2, as follows:

Step 201: Obtain a character from the text to be matched as the current character;

For example, the text to be matched is: looking for a cheap agent to open a business invoice. Scan the text to be matched, as follows: Retrieve the character 'seek' as the current character.

Step 202: Obtain all matching rules corresponding to the current character from the inverted index table, and determine, for each matching rule, whether the matching process information set of the keyword to which the matching rule belongs is empty, and if the matching process information set is empty, Then, step 203 is performed; if the matching process information set is not empty, step 207 is performed.

In this embodiment, if the matching rule corresponding to the current character is not obtained from the inverted index table, the current character is ignored, and the next character of the current character is obtained from the text to be matched, and the current character is used as the current character. 202.

Step 203: Determine whether the current character is the first character of the keyword, if yes, execute step 204; otherwise, execute step 212, that is, obtain the next character of the current character from the text to be matched, and use it as the current character, and then perform step 202.

Specifically, the judgment is performed according to an index bit of the current character recorded in the matching rule in the keyword.

Step 204: Add a matching process information to the matching process information set, and record the index bit of the current character in the to-be-matched text into the matching process information, and perform step 205.

In this embodiment, the matching process information is used to record the index bits of each character in a keyword in the text to be matched. In this embodiment, when the matching process information is newly created, an initial value may be set for each element in the matching process information, and each element in the matching process information represents whether each character in the keyword appears in the text to be matched, and The index bit in the text to be matched.

Specifically, the number of elements included in the matching process information is equal to the number of characters included in the corresponding keyword. And each element indicates whether each character in the keyword is included in the text to be matched. If not, the element corresponding to the character is an initial value, and if so, the element corresponding to the character is the character in the text to be matched. Index bit.

For example, when the character in the text to be matched is "find", it is judged that "find" is the first character in the keyword "find agent", and at this time, a new matching process information, a matching process information and A keyword corresponds, and the matching process information includes the number of elements equal to the number of characters included in the keyword, and each element in the newly created matching process information is set to an initial value. For example, the initial value of each element in the newly created matching process information is set to -1, indicating that each character in the keyword does not appear in the text to be matched, for example, the keyword "find?? proxy" has three characters, and is created. Contains the matching process information of three elements, and sets each element to -1, that is, the newly created matching process information is index[-1, -1, -1]. Obtain the scanned character "find" index bit 1 in the text to be matched, update the index bit corresponding to "find" in the matching process information to 1, and the updated matching process information is index[1,-1,- 1].

Step 205: Determine whether the matching is completed according to the matching process information. If the matching is completed, go to step 206. If the matching is not completed, execute step 212 to obtain the next character of the current character from the text to be matched, and use it as the current character. Step 202 is performed again.

In step S205, the determining whether the matching is completed according to the matching process information may include: determining whether there is an element whose value is an initial value in the matching process information, and if yes, indicating that the matching is not completed, otherwise indicating that the matching is completed.

It should be noted that, in the foregoing embodiment, when the character to be matched includes the character in the keyword, the element corresponding to the character in the matching process information corresponding to the keyword is set as the character in the text to be matched. The index bit, otherwise, the element corresponding to the character retains the initial value. Therefore, when all the elements included in the matching process information are non-initial values, it may be determined that all the characters in the keyword are included in the to-be-matched text, that is, the matching is completed.

Step 206: Output matching information according to the matching process information.

Specifically, according to the index bit of the first character recorded in the matching process information in the text to be matched, and the index bit of the last character in the text to be matched, the two index bits are obtained from the text to be matched. The character, including the characters on the two index bits, is used as matching information to output the matching information.

Step 207: Determine whether the current character is the first character of the keyword, if yes, go to step 208; otherwise, go to step 209.

Step 208: Add a new matching process information in the matching process information, record the index bit of the current character in the text to be matched into the matching process information, and perform step 205.

In step 209, the matching process information corresponding to all the keywords of the current character is obtained from the matching process information set, and the index bit of the current character in the text to be matched is recorded in each matching process information, and step 210 is performed.

In this embodiment, after obtaining the matching process information corresponding to all the keywords of the current character from the matching process information set, before the index bit of the current character in the text to be matched is recorded in each matching process information, the method further includes: Determining whether the index bit of the current character in the text to be matched has been recorded in the corresponding matching process information, and copying the current matching process information, and updating the index bit of the current character in the text to be matched to the matching process after copying In the information; otherwise, the index bit of the current character in the text to be matched is recorded into each matching process information, and step 210 is performed.

Step 210: Perform a distance check on each matching process information according to the character distance rule. If the verification result is successful, step 205 is performed; if the verification result is a failure, step 211 is performed.

In step 211, the keyword is marked as invalid, that is, the matching of the keyword to which the current character belongs is ended.

Step 212: Obtain the next character of the current character from the text to be matched, and use it as the current character; then perform step 202.

In this embodiment, it is verified whether each character in the keyword has already appeared in the text to be matched, and whether the character distance rule corresponding to the keyword and the index bit in the text to be matched in the keyword match the pre-pre- Set the relationship to determine the case where the characters match. There are three cases for character matching: in one case, there are other characters in the keyword in which the current character exists, and it is necessary to continue matching other characters. In another case, other characters in the keyword in which the current character exists have been matched, and the distance between the previous character of the current character and the previous character is greater than or equal to the previous character of the current character in the text to be matched. The distance between them is used to determine that the keyword is valid and is hit. In another case, other characters in the keyword existing in the current character have been matched, but the distance between the previous character of the current character and the current character is smaller than the previous character of the current character in the text to be matched. The distance between them to determine the keyword failure.

The method can be implemented as follows: obtaining a character distance rule corresponding to a keyword to which the current character belongs, and obtaining an index bit of a character of the current character in the to-be-matched text from the matching process information corresponding to the keyword to which the character belongs, according to the index bit and The index of the current character in the text to be matched, the first distance between the current character and the previous character of the current character is calculated, and the current character and the previous character of the current character are obtained from the character distance rule corresponding to the keyword of the current character. The second distance between the two, when the second distance indicates the length of any interval, or the second distance is greater than the first distance, it is determined whether there is an element with an initial value in the matching process information, and the addition indicates that the adding is successful, from the to-be-matched Gets the next character of the current character in the text and matches it as the current character; otherwise, the match is completed. If the second distance indicates that the interval length is not any interval, and the interval length indicated by the second distance is smaller than the interval length indicated by the first distance, indicating that the keyword matching of the current character is invalid, and ending the matching of the keyword may be matched from Get the next character of the current character in the text to match.

For the above embodiment, the fuzzy matching is performed by matching the text "Looking for a cheap agent to open a business invoice", for example:

The text to be matched is: “Looking for a cheap agent to open a business invoice”. Scan the text to be matched, as follows:

Retrieve the character 'seek', the index bit in the text to be matched is 0,

The corresponding matching rule was not found in the inverted index table, so the character is ignored.

Search for the character ‘find’ and the index bit in the text to be matched is 1.

The corresponding matching rule 6 is found in the inverted index table, and the current character 'find' is the first character of the matching rule 6 corresponding to the keyword "finding agent". Create a matching process information, the process information number is 1, and use the current character "find" to replace the initial value of the corresponding element in the matching process information corresponding to the keyword "find the proxy" in the index bit 1 of the text to be matched, as shown in Table 3. Show:

关键词Key words	匹配过程信息Matching process information	过程信息编号Process information number
找？？代理Find? ? proxy	index[]：[1，-1，-1]Index[]:[1,-1,-1]	11

table 3

The matching process information corresponding to the process information number 1 is verified, and the process information index[]:[1,-1,-1] is matched, and only the first bit is non-1, and the data of other bits are initial values. -1, at this time, indicates that the temporary match is not completed. For the current character "find", it is the first character in its keyword "find the proxy", there is no corresponding character distance rule, so it is no longer necessary to check according to the character distance rule.

The character ‘Yes’ is retrieved, and the index bit in the text to be matched is 2, and the corresponding matching rule is not found from the inverted index table, and is ignored.

The character ‘Yes’ is retrieved, and the index bit in the text to be matched is 3, and the corresponding matching rule is not found from the inverted index table, and is ignored.

The character 'generation' is retrieved, and the index bit in the text to be matched is 4, and the corresponding matching rules 1 and 2 are found from the inverted index table.

The matching rule 1 is processed, and the current character 'generation' is the first character of the matching rule 1 corresponding keyword "generation? invoice". The matching process information is newly created. The process information number is 2, and the initial value of the corresponding element in the matching process information 2 is replaced by the index bit 4 in the text to be matched with the current character "generation", as shown in Table 4:

关键词Key words	匹配过程信息Matching process information	过程信息编号Process information number
找？？代理Find? ? proxy	index[]：[1，-1，-1]Index[]:[1,-1,-1]	11
代？？开发票generation? ? Open invoice	index[]：[4，-1，-1，-1]Index[]:[4,-1,-1,-1]	22

Table 4

The matching process information corresponding to the process information number 2 is checked, and only the first bit of the matching process information index[]:[4,-1,-1,-1] is a non-initial value -1, and the remaining two bits are The data is all initial value -1, so the match is not completed yet.

The matching rule 2 is processed, and the current character 'generation' is not the first character of the keyword corresponding to the matching rule 2, and the matching process is not newly created. The corresponding keyword "find the proxy" already has the matching process information 1. Therefore, the initial value of the corresponding element in the matching process information 1 is replaced by the index bit 4 in the text to be matched with the current character "generation". The matching process information table is shown in Table 5:

关键词Key words	匹配过程信息Matching process information	过程信息编号Process information number
找？？代理Find? ? proxy	index[]：[1，4，-1]Index[]:[1,4,-1]	11
代？？开发票generation? ? Open invoice	index[]：[4，-1，-1，-1]Index[]:[4,-1,-1,-1]	22

table 5

The matching process information 1 is verified, and the first distance is calculated according to the first bit and the second bit in the matching process information index[]:[1,4,-1], and the first distance is 4-1-1=2. For the keyword "find the proxy" corresponding to the matching process information 1, the character distance rule set is [3, 2, 0]; in this keyword, the character distance rule between 'find' and 'generation' is 2, The distance between the two characters is represented, that is, the second distance is equal to 2, so the first distance and the second distance are in accordance with the character distance rule of the character 'generation' in the keyword "finding agent". However, the last bit is -1, so the match is not successful.

Retrieve the character 'reason', the index bit in the text to be matched is 5,

The corresponding matching rule 7 is found from the inverted index table. The current character 'ration' is not the first character of the keyword corresponding to the matching rule 7, and no new matching process is created. The corresponding keyword "find the proxy" already has matching process information 1, so the index bit 5 of the current character "reason" in the text to be matched is updated to the matching process information 1, and the updated matching process information table As shown in Table 6:

关键词Key words	匹配过程信息Matching process information	过程信息编号Process information number
找？？代理Find? ? proxy	index[]：[1，4，5]Index[]:[1,4,5]	11
代？？开发票generation? ? Open invoice	index[]：[4，-1，-1，-1]Index[]:[4,-1,-1,-1]	22

Table 6

The matching process information 1 is checked, and the first distance between the first bit and the second bit in the matching process information index[] is 4-1-1=2, for the keyword corresponding to the matching process information 1 ??Proxy", the set of character distance rules is [3, 2, 0], in which the character distance rule between 'find' and 'generation' is 2, indicating the distance between two characters is the second The distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'generation' in the keyword "finding agent". The same method is used, and the distance between the second and third places is 5- 4-1=0, in accordance with the distance rule of the character 'reason' in the keyword "find the agent", the distance check is passed. At the same time, all the bits of the matching process information are not -1, so the matching is successful, according to the string of the first and last bit of the matching process information hit "find cheap agent".

Retrieve the character 'on', the index bit in the text to be matched is 6,

The corresponding matching rule 3 is found from the inverted index table. The current character 'on' is not the first character of the keyword corresponding to the matching rule 3, and no new matching process is performed. And the corresponding keyword "generation? invoice" has matching process information 2, therefore, the index bit 6 of the current character "on" in the text to be matched is updated to the matching process information 2, and the updated matching process The information table is shown in Table 7:

Key words

Matching process information

Process information number

找？？代理Find? ? proxy	index[]：[1，4，5]Index[]:[1,4,5]	11
代？？开发票generation? ? Open invoice	index[]：[4，6，-1，-1]Index[]:[4,6,-1,-1]	22

Table 7

The matching process information 2 is checked, and the first distance between the first bit and the second bit in the matching process information index[] is 6-4-1=1, and the keyword corresponding to the process information 2 is replaced by the keyword "?? Open *invoice", the character distance rule set is [4,2,-1,0]; in this keyword, the character distance rule between 'generation' and 'on' is 2, indicating the distance between two characters That is, the second distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'on' in the keyword "on behalf of the open invoice", but the third and fourth digits are -1, so the match is not successful yet. .

The character ‘battalion' is retrieved, and the index bit in the text to be matched is 7, and the corresponding matching rule is not found from the inverted index table, and is ignored.

The character ' industry' is retrieved, and the index bit in the text to be matched is 8, and the corresponding matching rule is not found from the inverted index table, and is ignored.

The character 'fat' is retrieved, and the index bit in the text to be matched is 9, and the corresponding matching rule 4 is found from the inverted index table. The current character 'send' is not the first character of the keyword corresponding to the matching rule 4, and the new matching process is not created. . And the corresponding keyword "generation? invoice" has matching process information 2, therefore, the index bit 9 of the current character "send" in the text to be matched is updated to the matching process information 2, and the updated matching process The information table is shown in Table 8:

关键词Key words	匹配过程信息Matching process information	过程信息编号Process information number
找？？代理Find? ? proxy	index[]：[1，4，5]Index[]:[1,4,5]	11
代？？开发票generation? ? Open invoice	index[]：[4，6，9，-1]Index[]:[4,6,9,-1]	22

Table 8

The matching process information 2 is checked, and the first distance between the first bit and the second bit in the matching process information index[] is 6-4-1=1, and the keyword corresponding to the process information 2 is replaced by the keyword "?? Open *invoice", the character distance rule set is [4,2,-1,0]; in this keyword, the character distance rule between 'generation' and 'on' is 2, indicating the distance between two characters That is, the second distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'on' in the keyword "generation? invoice", and the first distance between the second and third places is 9 -6-1=2, the character distance rule between 'on' and 'fat' in this keyword is -1, indicating that the distance between two characters is the second distance is an arbitrary interval, indicating that the addition is successful. And the last bit is -1, so the match is not completed yet.

The character 'ticket' is retrieved, and the index bit in the text to be matched is 10, and the corresponding matching rule 5 is found from the inverted index table. The current character 'ticket' is not the first character of the keyword corresponding to the matching rule 5, and the new matching process is not created. . And the corresponding keyword "generation? invoice" has matching process information 2, therefore, the index bit of the current character in the text to be matched is updated to the matching process information 2, and the updated matching process information table is as shown in the table. 9 shows:

Key words

Matching process information

Process information number

找？？代理Find? ? proxy	index[]：[1，4，5]Index[]:[1,4,5]	11
代？？开发票generation? ? Open invoice	index[]：[4，6，9，10]Index[]:[4,6,9,10]	22

Table 9

The matching process information 2 is checked, and the first distance between the first and second digits in the matching process information index[] is 6-4-1=1, and the keyword corresponding to the process information 2 is “delivered*” Invoice", the set of character distance rules is [4, 2, -1, 0]; in this keyword, the character distance rule between 'generation' and 'on' is 2, indicating the distance between two characters The second distance is equal to 2, so the first distance and the second distance are in accordance with the distance rule of the character 'on' in the keyword "generation? invoice", and the first distance between the second and third digits is 9-6-1. =2, in the keyword, the character distance rule between 'on' and 'send' is -1, indicating that the distance between two characters, that is, the second distance is any interval, indicating that the addition is successful, in line with the key The distance rule for the character 'send' in the word "??? The first distance between the 2nd and 3rd digits is 10-9-1=0, which conforms to the distance rule of the character 'ticket' in the keyword "generation? invoice". At the same time, all the bits of the matching process information are not -1, so the matching is completed, and the character string "proxy opening business invoice" is hit according to the first bit and the last bit of the matching process information.

On the other hand, there may be a case where there are a plurality of identical characters in the text to be matched, and the character has a keyword. If there are two identical characters in the text to be matched, and the character has a keyword, give an example:

For example, the keyword is “generation??* invoice”, and the text to be matched is “Beijing agent opens a business to open a tax invoice”.

When the first 'on' in the text to be matched is scanned, the matching process information is updated, and the second bit of the matching process information array is updated to 4, that is, the matching process information array is [2, 4, -1, -1 ]. When scanning to the second 'on' in the text to be matched, it is found that the second bit of the matching process information array in the existing matching process information has an existing value (non-initial value -1), and a new match is copied at this time. Process information, which matches the array of process information [2, 7, -1, -1].

On the other hand, there may be cases in which the character to be matched has a certain keyword, but it is a non-first character in the keyword to be matched, for example, the keyword is “generation??* invoice” and “open* ticket”. The text to be matched is “Beijing Opens a Tax Invoice”.

When scanning to 'on' in the text to be matched, it is found that the keyword "generation? invoice" is not the first character, and the keyword "generation? invoice" is ignored, due to the 'in the text to be matched' The opening 'is the first character in the keyword "open * ticket". Therefore, when matching other characters after the 'on' in the matching text, only the keyword "open * ticket" is matched.

In this embodiment, by splitting the keywords, a matching rule is established for each character, and an inverted index table is created. When the matching text is matched, the keyword whose first character is not in the text to be matched is filtered out. It is not necessary to exhaust all the keywords to make the matching keyword operation easier. By checking the matching process information, the problem of backtracking and keyword invalidation when matching keywords is avoided, and the matching efficiency is improved.

The embodiment provides a device for multi-keyword fuzzy matching. As shown in FIG. 3, the method includes: an obtaining module 301, a determining module 302, a first marking module 303, a second marking module 304, and a determining module 305.

The obtaining module 301 is configured to obtain, for each character in the text to be matched, a keyword that belongs to the character and an index bit of the character in the keyword that belongs to the keyword according to the keyword set;

The determining module 302 is configured to determine, according to an index bit of the character in the keyword that belongs to the keyword, whether the character is the first character of the keyword that belongs to the keyword;

The first marking module 303 is configured to: when the determination result of the determining module is yes, record the keyword to which the character belongs in the matching information set, and mark the first character of the keyword in the record In the matching text;

a second marking module 304, configured to: when the determination result of the determining module is negative and the record of the keyword belongs to the matching information set, obtain a record of the keyword to which the character belongs, and Marking, in the record, the character in the keyword exists in the text to be matched;

The determining module 305 is configured to determine that the to-be-matched text hits the keyword when each character in a keyword is marked in the to-be-matched text.

In a preferred embodiment of the present application, the apparatus may further include: a matching rule building module, a matching rule set building module, and an inverted index building module.

The matching rule construction module is configured to respectively construct a matching rule corresponding to each character for each character of each keyword in the keyword set; the matching rule includes: a character, a keyword of the character, and a The number of characters included in the keyword to which the character belongs, and the index bit of the character in the associated keyword;

The matching rule set construction module is configured to acquire a matching rule corresponding to each character in the keyword, and form a matching rule set corresponding to the keyword;

The inverted index construction module is configured to construct an inverted index table from the character to the matching rule according to the matching rule set; the inverted index table includes: a character, and all matches corresponding to the character rule.

In the preferred embodiment of the present application, the obtaining module 301 may include: a traversal unit and a first acquiring unit.

The traversing unit is configured to traverse the inverted index table, and when the inverted index table includes the character, acquire all matching rules corresponding to the character;

The first obtaining unit is configured to acquire, for each matching rule, a keyword of the character included in the matching rule, and an index bit of the character in the associated keyword.

In a preferred embodiment of the present application, the first marking module 303 may include: a first recording unit.

The first recording unit is configured to: when the determination result of the determining module is yes, create matching process information corresponding to the keyword to which the character belongs, and save the matching process information in the matching process information set; The index bit of the character in the text to be matched is recorded in the matching process letter.

In a preferred embodiment of the present application, the second marking module 304 may include: a second recording unit.

The second recording unit is configured to: when the determination result of the determining module is negative, search for a matching information set, and determine whether there is matching process information corresponding to the keyword to which the character belongs in the matching process set, if yes, And acquiring matching process information corresponding to the keyword to which the character belongs, and recording an index bit of the character in the to-be-matched text into the matching process information.

In a preferred embodiment of the present application, the second recording unit may further include: a determining subunit, a copy updating subunit, and an index bit recording subunit.

The determining subunit is configured to determine whether an index bit of the character in the to-be-matched text has been recorded in the matching process information corresponding to the keyword to which the character belongs;

The copy update subunit is configured to: when the judgment result of the judgment subunit is YES, copy the matching process information corresponding to the keyword to which the character belongs, and use the current index bit of the character in the to-be-matched text. Updating an index bit of the character recorded in the matching process information in the to-be-matched text;

The index bit recording subunit is configured to: when the determination result of the determining subunit is negative, perform matching process information corresponding to the keyword to which the character belongs, and the character is in the to-be-matched text. The index bit is recorded to the step in the matching process information.

In a preferred embodiment of the present application, each character in the keyword is marked in the to-be-matched text, and each bit in the matching process information corresponding to the keyword to which the character belongs is set to correspond. The index bit of the character of the number of bits in the text to be matched.

In a preferred embodiment of the present application, the apparatus may further include: an output module.

The output module is configured to: after the determining module determines that the to-be-matched text hits the keyword, obtain matching information from the to-be-matched text according to an index bit of each character in the text to be matched, and output the Match information.

In a preferred embodiment of the present application, the apparatus may further include: a character distance construction module and a matching verification module.

The character distance construction module is configured to separately construct a character distance rule for each keyword in the keyword set to form a character distance rule set, where the character distance rule includes: a number of characters included in the keyword, the key The effective distance between each character in the word and its previous character;

The matching check module is configured to perform distance matching check on the keyword to which the current character belongs according to the character distance rule set when the matching text is matched.

In a preferred embodiment of the present application, the matching verification module may include: a second obtaining unit, a third obtaining unit, a first checking unit, and a second checking unit.

The second obtaining unit is configured to obtain, from the keyword to which the current character belongs, an index bit of the previous character of the current character in the to-be-matched text, and the to-be-matched according to the previous character of the current character Calculating a first distance between the current character and a previous character of the current character by using an index bit in the text and an index bit of the current character in the to-be-matched text;

The third obtaining unit is configured to obtain, as a second distance, an effective distance between the current character and a previous character of the current character from a character distance rule corresponding to a keyword to which the current character belongs;

The first checking unit is configured to: if the second distance represents any interval length, or the second distance is greater than the first distance, indicating that the distance verification is successful, acquiring the next character of the current character Match

The second check unit is configured to: if the second distance indicates that it is not any interval length, and the second distance is smaller than the first distance, indicating that the distance check fails, the matching process information Invalidation, ending the matching of the keywords to which the character belongs.

With regard to the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment relating to the method, and will not be explained in detail herein.

FIG. 4 is a block diagram of an apparatus 800 for keyword fuzzy matching, according to an exemplary embodiment. For example, device 800 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.

Referring to FIG. 4, apparatus 800 can include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, And a communication component 816.

Processing component 802 typically controls the overall operation of device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 802 can include one or more processors 820 to execute instructions to perform all or part of the steps of the above described methods. Moreover, processing component 802 can include one or more modules to facilitate interaction between component 802 and other components. For example, processing component 802 can include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.

Memory 804 is configured to store various types of data to support operation at device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phone book data, messages, pictures, videos, and the like. The memory 804 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.

Power component 806 provides power to various components of device 800. Power component 806 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 800.

The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen can include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input an audio signal. For example, the audio component 810 includes a microphone (MIC) that is activated when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. Configured to receive external audio signals. The received audio signal may be further stored in memory 804 or transmitted via communication component 816. In some embodiments, the audio component 810 also includes a speaker for outputting an audio signal.

The I/O interface 812 provides an interface between the processing component 802 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.

Sensor assembly 814 includes one or more sensors for providing device 800 with a status assessment of various aspects. For example, sensor assembly 814 can detect an open/closed state of device 800, a relative positioning of components, such as the display and keypad of device 800, and sensor component 814 can also detect a change in position of one component of device 800 or device 800. The presence or absence of user contact with device 800, device 800 orientation or acceleration/deceleration, and temperature variation of device 800. Sensor assembly 814 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 816 is configured to facilitate wired or wireless communication between device 800 and other devices. The device 800 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above methods.

In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium comprising instructions, such as a memory 804 comprising instructions executable by processor 820 of apparatus 800 to perform the above method. For example, the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.

A non-transitory computer readable storage medium, when instructions in the storage medium are executed by a processor of a mobile terminal, enabling the mobile terminal to perform a method for keyword fuzzy matching, the method comprising: treating Matching each character in the text, respectively acquiring a keyword to which the character belongs and an index bit of the character in the keyword according to the keyword set; determining the character according to the index bit of the character in the keyword Whether it is the first character of the keyword; if the character is the first character of the keyword, the keyword belonging to the character is recorded in the matching information set, and the first character of the keyword is marked in the record. In the text to be matched; if the character is not the first character of the keyword, and the record of the keyword to which the character belongs exists in the matching information set, the record of the keyword to which the character belongs is obtained, and The character in the keyword in the record is present in the In the text to be matched; when each character in a keyword is marked in the text to be matched, it is determined that the text to be matched hits the keyword.

FIG. 5 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900 can vary considerably depending on configuration or performance, and can include one or more central processing units (CPUs) 1922 (eg, one or more processors) and memory 1932, one or one The above storage medium 1942 or storage medium 1930 of data 1944 (eg, one or one storage device in Shanghai). Among them, the memory 1932 and the storage medium 1930 may be short-term storage or persistent storage. The program stored on storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations in the server. Still further, central processor 1922 can be configured to communicate with storage medium 1930, which performs a series of instruction operations in storage medium 1930.

The embodiment of the present invention further provides a program, including a readable code, when the readable code is run on a server, causing the server to perform the keyword fuzzy matching method according to any one of the embodiments of the present invention. . Also provided is a readable medium in which a program as described in an embodiment of the present invention is stored.

For example, FIG. 6 shows a server that can implement the keyword fuzzy matching method according to the present invention. The server conventionally includes a processor 1610 and a program product or readable medium in the form of a memory 1620. The memory 1620 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, or a ROM. Memory 1620 has a memory space 1630 for program code 1631 for performing any of the method steps described above. For example, storage space 1630 for program code may include various program code 1631 for implementing various steps in the above methods, respectively. These program codes can be read from or written to one or more program products. These program products include program code carriers such as memory cards. Such a program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have a storage section, a storage space, and the like arranged similarly to the storage 1620 in the server of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes readable code 1631', i.e., code that can be read by, for example, a processor such as 1610, which when executed by the server causes the server to perform various steps in the methods described above.

Server 1900 may also include one or more power sources 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941. For example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the <RTIgt; The present invention is intended to cover any variations, uses, or adaptations of the present invention, which are in accordance with the general principles of the invention and include common general knowledge or common technical means in the art that are not disclosed in the present disclosure. . The specification and examples are to be considered as illustrative only,

It is to be understood that the invention is not limited to the details of the details of The scope of the invention is limited only by the appended claims.

The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims

A method for keyword fuzzy matching, comprising:

For each character in the matched text, the keyword to which the character belongs and the index bit of the character in the associated keyword are respectively acquired according to the keyword set;

Determining whether the character is the first character of the keyword according to an index bit of the character in the associated keyword;

If the character is the first character of the keyword, the keyword belonging to the character is recorded in the matching information set, and the first character of the keyword is marked in the record to be present in the text to be matched;

If the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, the record of the keyword to which the character belongs is obtained, and the key is marked in the record The character in the word exists in the text to be matched;

When each character in a keyword is marked in the to-be-matched text, it is determined that the to-be-matched text hits the keyword.
The method of claim 1 wherein the method further comprises:

For each character of each keyword in the keyword set, a matching rule corresponding to each character is separately constructed; the matching rule includes: a character, a keyword to which the character belongs, and a number of characters included in the keyword to which the character belongs And an index bit of the character in the associated keyword;

Obtaining a matching rule corresponding to each character in the keyword to form a matching rule set corresponding to the keyword;

And constructing an inverted index table from the character to the matching rule according to the matching rule set; the inverted index table includes: a character, and all matching rules corresponding to the character.
The method according to claim 2, wherein the obtaining, by the keyword set, the keyword to which the character belongs and the index bit of the character in the keyword, includes:

Traversing the inverted index table, when the inverted index table includes the character, acquiring all matching rules corresponding to the character;

For each matching rule, the keyword of the character included in the matching rule and the index bit of the character in the associated keyword are respectively acquired.
The method according to claim 1, wherein if said character is a first character of a keyword, the keyword to which the character belongs is recorded in the matching information set, and the keyword is marked in the record The first character exists in the text to be matched, including:

If the character is the first character of the keyword, the matching process information corresponding to the keyword to which the character belongs is newly created, and the matching process information is saved in the matching information set; and the character is to be matched. The index bits in the text are recorded into the matching process information.
The method according to claim 4, wherein if the character is not the first character of the keyword and the record of the keyword belongs to the matching information set, the keyword of the character is acquired. Recording, and marking in the record that the character in the keyword exists in the text to be matched, including:

If the character is not the first character of the keyword, the matching information set is searched, and the matching process information corresponding to the keyword to which the character belongs is determined in the matching process set, and if yes, the character is acquired. The keyword corresponding matching process information is recorded, and an index bit of the character in the to-be-matched text is recorded into the matching process information.
The method according to claim 5, wherein when the matching process information corresponding to the keyword to which the character belongs is present in the matching process set, the method further includes:

Determining, in the matching process information corresponding to the keyword to which the character belongs, whether an index bit of the character in the to-be-matched text has been recorded;

If yes, the matching process information corresponding to the keyword to which the character belongs is copied, and the character that has been recorded in the matching process information is updated in the to-be-matched text by using the current index bit of the character in the to-be-matched text. Index bit in ;

If not, performing the step of acquiring the matching process information corresponding to the keyword to which the character belongs, and recording the index bit of the character in the to-be-matched text into the matching process information.
The method according to claim 4, wherein each of the characters in the keyword is marked in the text to be matched, including:

Each bit in the matching process information corresponding to the keyword to which the character belongs is set as an index bit of the character corresponding to the number of bits in the text to be matched.
The method according to claim 7, wherein after the determining that the text to be matched hits the keyword, the method further comprises: according to an index bit of each character in the text to be matched The matching information is obtained in the matching text, and the matching information is output.
The method of claim 1 further comprising:

Constructing a character distance rule for each keyword in the keyword set to form a character distance rule set; the character distance rule includes: a number of characters included in the keyword, each character in the keyword and a previous character thereof effective distance;

When the matched text is matched, the distance matching check is performed on the keyword of the current character according to the character distance rule set.
The method according to claim 9, wherein the matching and verifying the keywords of the current character according to the character distance rule set comprises:

Obtaining, from the keyword to which the current character belongs, an index bit of the previous character of the current character in the text to be matched, according to an index bit of the previous character of the current character in the text to be matched, and the current Calculating, by an index bit in the text to be matched, a first distance between the current character and a previous character of the current character;

Obtaining, by the character distance rule corresponding to the keyword of the current character, an effective distance between the current character and a previous character of the current character as the second distance;

If the second distance indicates any interval length, or the second distance is greater than the first distance, it indicates that the distance verification is successful, and the next character of the current character is obtained for matching;

If the second distance indicates that it is not any interval length, and the second distance is smaller than the first distance, it indicates that the distance check fails, the matching process information is invalid, and the keyword belonging to the character is ended. Match.
A device for fuzzy matching of keywords, comprising:

An obtaining module, configured to acquire each keyword in the text, and obtain, according to the keyword set, a keyword to which the character belongs and an index bit of the character in the associated keyword;

a determining module, configured to determine, according to an index bit of the character in the keyword, whether the character is the first character of the keyword;

a first marking module, configured to: when the determination result of the determining module is YES, record a keyword to which the character belongs in the matching information set, and mark, in the record, that the first character of the keyword exists in the to-be-matched In the text;

a second marking module, configured to: when the determination result of the determining module is negative and the record of the keyword belongs to the matching information set, obtain a record of the keyword to which the character belongs, and record the record Marking the character in the keyword in the text to be matched;

And a determining module, configured to: when each character in a keyword is marked in the to-be-matched text, determine that the to-be-matched text hits the keyword.
An apparatus for keyword fuzzy matching, comprising: a memory, and one or more programs, wherein one or more programs are stored in a memory and configured to be executed by one or more processors The one or more programs include instructions for performing the following operations:

For each character in the matched text, the keyword to which the character belongs and the index bit of the character in the associated keyword are respectively acquired according to the keyword set;

Determining whether the character is the first character of the keyword according to an index bit of the character in the associated keyword;

If the character is the first character of the keyword, the keyword belonging to the character is recorded in the matching information set, and the first character of the keyword is marked in the record to be present in the text to be matched;

If the character is not the first character of the keyword, and the record of the keyword belongs to the matching information set, the record of the keyword to which the character belongs is obtained, and the key is marked in the record The character in the word exists in the text to be matched;

When each character in a keyword is marked in the to-be-matched text, it is determined that the to-be-matched text hits the keyword.
A program comprising readable code that, when run on a server, causes the server to perform a method of keyword fuzzy matching according to any of claims 1-10.
A readable medium storing the program of claim 13.