CN111400563A - Pattern matching method and device for pattern matching - Google Patents

Pattern matching method and device for pattern matching Download PDF

Info

Publication number
CN111400563A
CN111400563A CN202010183402.5A CN202010183402A CN111400563A CN 111400563 A CN111400563 A CN 111400563A CN 202010183402 A CN202010183402 A CN 202010183402A CN 111400563 A CN111400563 A CN 111400563A
Authority
CN
China
Prior art keywords
string
pattern
matching
bit vector
vector table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010183402.5A
Other languages
Chinese (zh)
Other versions
CN111400563B (en
Inventor
孙浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010183402.5A priority Critical patent/CN111400563B/en
Publication of CN111400563A publication Critical patent/CN111400563A/en
Application granted granted Critical
Publication of CN111400563B publication Critical patent/CN111400563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the invention provides a pattern matching method and device and a device for pattern matching. The method specifically comprises the following steps: performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string; encoding each participle in the first participle set to obtain an encoded mode string; distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table. The embodiment of the invention can improve the utilization rate of the B table and improve the efficiency of pattern matching.

Description

Pattern matching method and device for pattern matching
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a pattern matching method and apparatus, and an apparatus for pattern matching.
Background
String matching, also known as pattern matching, is a key technology widely used in the fields of information retrieval, intrusion detection, computational biology, search engines, data compression, and the like. Pattern matching refers to finding all appearance positions of a certain pattern string P ═ p1p2.. pm in a text string T ═ T1t2.. tn.
The bit-parallel algorithm is a pattern Matching algorithm which is commonly used at present, and the bit-parallel algorithm comprises shift-and (shift-and), shift-or (shift-or), BNDM (Backward probabilistic big Matching) generally, the bit-parallel algorithm maintains a bit vector table, referred to as B table, in a computer cache, the B table can be understood as a 0/1 matrix of n × m, and 0/1 in the table is used for indicating whether a corresponding character appears in a pattern string.
However, each bit in the B-table represents a character, resulting in the B-table occupying a large amount of memory space. Furthermore, in practical applications, not every matched result is meaningful due to the specificity of semantics. For example, if the text string is "what is in commander in liude hua" and the pattern string is "hua yes", the matching is successful, but the matching has no meaning, because the text string and the "hua yes" have no semantic association relationship, and a meaningless matching result exists.
Disclosure of Invention
The embodiment of the invention provides a pattern matching method, a pattern matching device and a pattern matching device, which can improve the utilization rate of a B table and improve the pattern matching efficiency.
In order to solve the above problem, an embodiment of the present invention discloses a pattern matching method, where the method includes:
performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
encoding each participle in the first participle set to obtain an encoded mode string;
distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
In another aspect, an embodiment of the present invention discloses a pattern matching apparatus, including:
the first word segmentation module is used for segmenting words of the pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
the first coding module is used for coding each participle in the first participle set to obtain a coded mode string;
the distribution module is used for distributing a bit vector table corresponding to the coded mode string, and bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string or not;
and the pattern matching module is used for matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
In yet another aspect, an embodiment of the present invention discloses an apparatus for pattern matching, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors include instructions for:
performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
encoding each participle in the first participle set to obtain an encoded mode string;
distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a pattern matching method as described in one or more of the preceding.
The embodiment of the invention has the following advantages:
before pattern matching, performing word segmentation on a pattern string to obtain a first word segmentation sequence corresponding to the pattern string, and encoding each word segmentation in the first word segmentation sequence to obtain an encoded pattern string; allocating a bit vector table corresponding to the encoded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the pattern string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
Firstly, the embodiment of the invention carries out word segmentation on the pattern string and then encodes the pattern string, which not only can correct semantic errors and filter meaningless matching results, but also can compress the value range length of the B table by the encoding after word segmentation, reduce the space occupied by the cache and improve the utilization rate of the B table.
In addition, the embodiment of the invention selects the target substring through the word frequency of the substring, can reduce the probability of generating the collision mode string when the target substring is hit, and further can improve the subsequent query efficiency.
Furthermore, under the condition that the target substring has the collision mode string, the query sequence of the collision mode string is determined based on the word frequency of the mode string, so that the query times are reduced as much as possible, and the query efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flow chart of the steps of one embodiment of a pattern matching method of the present invention;
FIG. 2 is a block diagram of a pattern matching apparatus according to an embodiment of the present invention;
FIG. 3 is a block diagram of an apparatus 800 for pattern matching of the present invention; and
fig. 4 is a schematic diagram of a server in some embodiments of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Method embodiment
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a pattern matching method of the present invention is shown, which may specifically include the following steps:
101, performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
102, coding each participle in the first participle set to obtain a coded mode string;
103, allocating a bit vector table corresponding to the encoded pattern string, where a bit in the bit vector table is used to indicate whether a corresponding participle appears in the pattern string;
and step 104, matching the mode string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
The mode matching method of the embodiment of the invention can be applied to electronic equipment, and the electronic equipment comprises but is not limited to a server, a smart phone, a recording pen, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio L eye III) player, an MP4 (Moving Picture Experts Group Audio L eye IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, a smart television, wearable equipment and the like.
The mode matching method of the embodiment of the invention can be used for single-mode matching or multi-mode matching, wherein the single-mode matching refers to matching a mode string in a text string; multi-modal matching refers to matching multiple modal strings in one text string.
Before matching the text string and the pattern string, the embodiment of the invention performs word segmentation processing on the pattern string to obtain a first word segmentation set corresponding to the pattern string.
In one example of the chinese language according to the embodiment of the present invention, assuming that the text string to be matched is "what is commander for the liu de hua", the pattern string to be matched includes: two pattern strings of general reason and Huayi are used. Firstly, segmenting words of two pattern strings to be matched respectively, and obtaining a first segmentation set { "why", "then", "general", and "Hua is" } according to a segmentation result. And then, coding each participle in the first participle set to obtain a coded mode string. For example, if "why" is coded as C, "then" is coded as D, "commander" is coded as E, and "Huazhen" is coded as F, then the two encoded pattern strings are: CDE and F.
In an optional embodiment of the present invention, before the step 104 matches the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table, the method may further include:
step S11, performing word segmentation on the text string to obtain a second word segmentation set corresponding to the text string;
and step S12, coding each participle in the second participle set to obtain a coded text string.
After the pattern matching system reads in the text string, word segmentation can be performed on the text string to obtain a second word segmentation set corresponding to the text string. For example, in the above example, the text string "what is commander in liu de hua" may be participled to obtain a second set of participles { "liu de hua", "why", "then", "commander" }, and then each participle in the second set of participles may be encoded to obtain an encoded text string. For example, encoding "why" as C, "then" as D, "commander" as E, "liudeluxe" as a, then the encoded text string is: and (4) ACDE. The first participle set and the second participle set can adopt the same encoding rule in the encoding process, so that the same participle can correspond to the same code.
Generally, the bit parallel algorithm will maintain a bit vector table, referred to as B table, in the computer cache, where the length of B table is not more than w, where w is the bit length (a machine word length, such as 64 bits) processed by the computer at one time, B table may be 0/1 matrix of n × m, 0/1 in the table is used to indicate whether the character at the corresponding position appears in the pattern string, and n is the value range length of the pattern string.
Taking a 64-bit vector table as an example, in the case of single-mode matching (the number n of pattern strings to be matched is 1), the pattern strings to be matched may be directly allocated to the 64-bit vector table. And in the case of multi-mode matching (the number n of the mode strings to be matched is greater than 1), the bit vector table needs to be grouped and used, and the sum of the bit lengths of all the groups does not exceed w (64 bits). For the bit vector table with the number of m packets, firstly, n encoded pattern strings are allocated to the bit vector table containing m packets, and then, in a plurality of packets of one machine word, bit operations of shift-and algorithm are simultaneously executed on the plurality of pattern strings. Allocating n encoded pattern strings to a bit vector table containing m packets, there may be two cases where n is less than or equal to m, and each pattern string may be allocated one packet separately; in the case where n is greater than m, it is possible to assign a plurality of pattern strings to the same packet. For convenience of description, in the embodiments of the present invention, a case where n is less than or equal to m is taken as an example for explanation, and when n is greater than m, it is only necessary to compare a plurality of pattern strings in one packet to determine which pattern string hits.
In step 102 of the embodiment of the present invention, after the participles in the first participle set are encoded to obtain the encoded pattern string, a bit vector table (B table) may be allocated to the encoded pattern string, so that a bit in the B table corresponding to the encoded pattern string may be used to indicate whether a corresponding participle appears in the pattern string to be matched.
Step 104, matching the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table, which may specifically include: and matching the coded mode string with the coded text string based on a bit parallel algorithm according to the bit vector table.
For example, in the example above, the question of matching the pattern string "why commander" and "huaye" in the text string "what commander is in liude hua" is translated into: the problem of matching the encoded pattern strings "CDE" and "F" in the encoded text string "ACDE". It can be seen that "F" does not exist in "ACDE", and therefore, the finally obtained matching string includes "why is so handsome" but does not include "hua ye", and the embodiment of the present invention performs coding based on the participled pattern string, and can filter meaningless matching results.
In addition, the embodiment of the invention allocates the B table based on the pattern string after word segmentation, can reduce the value range length of the B table and improve the utilization rate of the B table. For example, in the above example, for the pattern string "why it is commander" and "Huawei", B-tables are assigned in the conventional manner, each bit in the B-table representing a character, the value range length of the B-table being typically an exponential order of 2. However, in the embodiment of the present invention, the pattern string "why is commander" and "huayi" are participled, the participle "why" is coded as C, "then" is coded as D, "commander" is coded as E, "huayi" is coded as F, and the participles that do not appear in the pattern string are uniformly coded as G. Therefore, the value range length of the B table is only 5, the value range length of the B table is greatly reduced, and the utilization rate of the B table can be improved. Referring to table 1, an example of a table B assigned to the pattern string "why is commander" in an embodiment of the present invention is shown.
TABLE 1
Figure BDA0002413334110000061
Figure BDA0002413334110000071
As shown in table 1, the pattern string "why is general" is segmented and encoded to obtain an encoded pattern string "CDE", where "C" is the code corresponding to the segmented word "why", and as shown in table 1, the first row is the bit mask "001" corresponding to the code "C". 0/1 in the bit mask is used to indicate whether the character in the corresponding position appears in the pattern string, it should be noted that in the embodiment of the present invention, the representation order of the characters in the pattern string is opposite to the representation order of the characters in the bit mask. For example, in "001", the 1 st bit from the right to the left is 1, and the remaining bits are 0, and this indicates that in the encoded pattern string "CDE", the 1 st bit from the left to the right is "C". Similarly, as shown in table 1, the second behavior encodes a bit mask "010" corresponding to "D". In "010", the 2 nd bit from the right to the left is 1, and the remaining bits are 0, which indicates that the 2 nd bit from the left to the right is "D" in the encoded pattern string "CDE". The third row encodes a bit mask "100" corresponding to "E". In "100", the 3 rd bit from the right to the left is 1, and the remaining bits are 0, which indicates that the 3 rd bit from the left to the right is "E" in the encoded pattern string "CDE". The bit mask "000" in the fourth row represents characters that do not appear in the encoded pattern string "CDE", i.e., other participles that do not appear in the pattern string "why it is commander".
It should be noted that, the above-mentioned encoding manner of "why" is encoded as "C", "then" is encoded as "D", "general" is encoded as "E", "Huaqi" is encoded as "F", and other words are encoded as "G" in a unified manner is only an application example of the embodiment of the present invention, and the embodiment of the present invention does not limit the specific encoding manner.
For example, in the case of multi-mode matching, for n pattern strings to be matched, the n pattern strings are firstly subjected to word segmentation to obtain a word segmentation set after word segmentation. And then coding each participle in the participle set, wherein each participle corresponds to one code. Assuming that there are 100 participles in the participle set obtained by performing the participle on the n pattern strings, and the 100 participles are encoded from 0, then the codes corresponding to the 100 participles are from 0 to 99, each code corresponds to a participle, for example, the code corresponding to the first participle is 0, the code corresponding to the second participle is 1, and so on, the code corresponding to the 100 participle is 99, and the participles that do not appear in the 100 participles are encoded to be 100. And then, coding the n mode strings to be matched according to the codes corresponding to the participles, so as to obtain the n coded mode strings. Finally, the B table is allocated to the encoded n pattern strings.
The embodiment of the invention carries out word segmentation on the pattern string and then encodes the pattern string, thereby not only correcting semantic errors and filtering meaningless matching results, but also compressing the value range length of the B table by the encoding after word segmentation, reducing the space occupied by the cache and improving the utilization rate of the B table.
In an optional embodiment of the present invention, before the step 103 allocates the bit vector table corresponding to the encoded pattern string, the method may further include:
step S21, when the length of the coded pattern string is larger than the preset grouping length, the coded pattern string is divided according to the grouping length to obtain each sub-string after division;
step S22, determining a target sub-string in each sub-string;
based on this, the step 103 of allocating the bit vector table corresponding to the encoded pattern string may specifically include: and distributing a bit vector table corresponding to the target substring.
In a specific application, the multi-mode matching includes a plurality of pattern strings to be matched, so that the B table is usually required to be grouped, and the sum of bit lengths of all the groups does not exceed w (bit length processed by a computer at one time, such as a machine word length). In the bit-parallel algorithm, when the length h of a pattern string is greater than the packet length M of the B table, the pattern string needs to be truncated. Wherein, the packet length M can be preset according to actual needs. It is common practice to intercept substrings of the first M characters or substrings of the last M characters of a pattern string in place of the pattern string.
In an english example of the embodiment of the present invention, for the text string "ababfdebc", it is assumed that the pattern string to be matched includes two pattern strings, "abcde" and "abfde", and the pattern strings "abcde" and "abfde" are allocated in the same group in the B table, assuming that the length M of the group is 2 bits, and the lengths of the pattern strings "abcde" and "abfde" are both greater than the group length, and therefore, the pattern strings "abcde" and "abfde" need to be respectively truncated, assuming that the truncation is performed by intercepting the first 2 characters, both the pattern strings are the intercepted substring "ab" as the target substring, that is, the target substring "ab" is used instead of the pattern string "abcde", and the target substring "ab" is used instead of the pattern string "abfde". At this time, only the B table corresponding to the target substring "ab" needs to be assigned.
Optionally, the step 104 of matching the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table may specifically include:
step S31, matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring to obtain the position of the matching string in the text string;
and step S32, inquiring whether the text string hits the pattern string corresponding to the target sub-string according to the position of the matching string in the text string.
In the above example, the same target substring "ab" is used for both the pattern string s1 ═ abcde "and s2 ═ abfde", and the assigned B table is shown in table 2.
TABLE 2
a 01
b 10
c 00
According to the B table shown in table 2, the target sub-string "ab" is matched with the text string "ababfdeac" based on a bit-parallel algorithm, and 3 matching strings "ab" are obtained, where pos is 0, pos is 2, and pos is 7 in the text string "ababfdeac".
In the embodiment of the present invention, it is assumed that a target substring corresponds to k pattern strings, where k is an integer greater than 1, and in a process of matching a text string to be matched using the target substring, if the text string hits the target substring, that is, if the target substring exists in the text string, collision of the k pattern strings occurs.
For example, in the above example, the target substring "ab" corresponds to two pattern strings (k ═ 2), which are the pattern string "abcde" and the pattern string "abfde", respectively. In the process of matching the target sub-string "ab" with the text string "ababfdebac", when the second character "b" of the text string is read in, it can be found that the text string hits the target sub-string "ab". Because the target substring "ab" corresponds to the pattern string "abcde" and "abfde" at the same time, the pattern string "abcde" and "abfde" collide with each other, and the pattern string "abcde" and "abfde" are called as collision pattern strings. At this time, it cannot be determined which pattern string matches the text string "ababfdebac" in the pattern strings "abcde" and "abfde".
It can be seen that, in the case of using a target substring instead of a pattern string, because the target substring represents only a part of the content of the pattern string, the text string hits the target substring, and only represents that the text string has a possibility of hitting the pattern string corresponding to the target substring, all collision pattern strings generated by the target substring need to be traversed, and according to the position of the matching string in the text string, whether the text string hits each collision pattern string is further queried.
For example, the target substring "ab" generates the collision pattern strings "abcde" and "abfde", and it is assumed that the query sequence is to query the pattern string "abcde" first and then to query the pattern string "abfde". First, whether or not the character string "ababf" having a length of 5 matches the pattern string "abcde" from the position pos of 0 in the text string "ababfdebc" is compared, resulting in a mismatch. Then, whether or not the character string "abfde" having a length of 5 matches the pattern string "abcde" from the position pos of 2 in the text string "ababfdebc" is compared, and the result is a mismatch. Finally, whether a character string having a length of 5 matches the pattern string "abcde" starting from the position pos of 7 in the text string "ababfdebc" is compared, and since the length from the position pos of 7 to the last character of the character string is only 3, there is no match.
After the query on the pattern string "abcde" is completed, it is found that the pattern string "abcde" does not exist in the text string "ababfdeacc", that is, the text string "ababfdeacc" does not hit the pattern string "abcde", and at this time, it is queried whether the text string "ababfdeacc" hits the pattern string "abcde" according to the above query process. The query result is that the text string "ababfdebac" hits the pattern string "abfde", and the position of the pattern string "abfde" in the text string "ababfdebac" is pos ═ 2.
It can be seen that, in the above example, the first two characters "ab" of the pattern strings "abcde" and "abfde" are truncated as target substrings instead of the pattern strings, resulting in the collision pattern strings "abcde" and "abfde" being generated in the matching process. In practical applications, the greater the number of pattern strings, the greater the number of collision pattern strings may be, and subsequently, each collision pattern string needs to be further queried and compared to determine whether a text string really hits the collision pattern string, which results in low matching efficiency.
In an optional embodiment of the present invention, step S32, according to the position of the matching string in the text string, queries whether the text string hits the pattern string corresponding to the target sub-string, which may specifically include:
step S41, if the target substring has collision mode strings, respectively calculating the word frequency of each collision mode string;
and step S42, sequentially inquiring whether the text string hits the collision mode string or not according to the positions of the matching strings in the text string and the sequence of the word frequency of the collision mode string from large to small.
In the above example, the pattern string includes "abcde" and "abfde", and if the first two characters "ab" of the truncated pattern strings "abcde" and "abfde" are taken as target substrings instead of the pattern string, the target substrings "ab" may have collision pattern strings "abcde" and "abfde" in the process of matching with the text string "ababfababc".
It can be seen from the above example that, in the case of a collision pattern string existing in a target substring, the order of querying the collision pattern string in the text string will affect the efficiency of the query. In order to improve the query efficiency, the embodiment of the invention calculates the word frequency of each pattern string in the collision pattern string.
Specifically, the word frequency of each pattern string in the collision pattern string may be calculated based on a given corpus, for example, the word frequencies of the pattern strings "abcde" and "abfde" are calculated separately, and if f (abcde) is 10 and f (abfde) is 20, the pattern strings "abcde" and "abfde" are hung into a linked list or sequentially stored according to the word frequencies from large to small, and the query is performed according to the word frequency order, that is, the pattern string "abfde" is queried first, and then the pattern string "abcde" is queried.
The word frequency of the collision mode string can be obtained according to corpus statistics, which shows that under normal conditions, the probability of the mode string 'abfde' is higher than that of the mode string 'abcde', so that the mode string with higher word frequency is searched first, the hit probability is higher, the frequency of subsequent queries can be reduced, and the efficiency of mode matching is improved.
In an optional embodiment of the present invention, the step S22, determining a target sub-string in each sub-string, specifically, may include: and determining the substring with the minimum word frequency as a target substring in each substring.
In order to reduce the probability of the target substring generating the collision mode string as much as possible, the embodiment of the invention determines the target substring according to the word frequency. Specifically, when the length h of the pattern string exceeds the group length M of the B table, the pattern string is sequentially divided according to the group length M of the B table to obtain each substring of the pattern string, the word frequency of each substring obtained after division is respectively calculated, and the substring with the minimum word frequency is selected as a target substring.
For example, in the above example, for the pattern string "abcde", the length h is 5, and the group length M of the B-table packet allocated to the pattern string is assumed to be 2. Dividing the pattern string according to the group length M of the B table to obtain each substring of the pattern string as follows: "ab", "bc", "cd", "de", "bf", "fd", "de". The word frequency F of each substring is calculated, assuming that F (ab) is 2, F (bc) is 1, F (cd) is 1, F (de) is 2, F (bf) is 1, F (fd) is 1, and the word frequency of the pattern string "abcde", the substring "bc" and the substring "cd" is minimum, so that the substring "bc" or "cd" can be used as the target substring of the pattern string "abcde". Similarly, for the pattern string "abfde", the word frequency of the substring "cd" and the substring "bf" is the minimum, so that the substring "cd" or the substring "bf" can be used as the target substring of the pattern string "abfde".
In one example, assume that substring "bc" is the target substring of the pattern string "abcde" and substring "bf" is the target substring of the pattern string "abfde", and the assigned B table is shown in table 3.
TABLE 3
a 00
b 01
c 10
d 00
e 00
f 10
g 00
In the process of matching the text strings according to table 3, when the text string hits the target sub-string "bc" or hits the target sub-string "bf", the pattern string "abcde" and the pattern string "abfde" do not collide. It can be seen that the embodiment of the invention selects the target substring through the word frequency of the substring, can reduce the probability of generating the collision mode string when the target substring is hit, and further can improve the subsequent query efficiency.
It can be understood that the embodiment of the present invention does not limit the specific way of calculating the word frequency of the substring. For example, the calculation may be performed based on a document in an application environment where the current text string is located, or may be performed based on big data.
In an optional embodiment of the present invention, in step 103, if the number n of the pattern strings to be matched is greater than 1, the allocating the bit vector table corresponding to the encoded pattern string may specifically include:
step S51, determining the grouping number m of the bit vector table;
step S52, the n encoded pattern strings are allocated to a bit vector table containing m packets.
In the case of multi-mode matching (the number n of pattern strings to be matched is greater than 1), the bit vector table needs to be grouped and used, and the sum of bit lengths of all the groups does not exceed w (64 bits). For the bit vector table with the number of m packets, firstly, n encoded pattern strings are allocated to the bit vector table containing m packets, and then, in a plurality of packets of one machine word, bit operations of shift-and algorithm are simultaneously executed on the plurality of pattern strings. Allocating n encoded pattern strings to a bit vector table containing m packets, there may be two cases where n is less than or equal to m, and each pattern string may be allocated one packet separately; in the case where n is greater than m, it is possible to assign a plurality of pattern strings to the same packet.
For example, 5 pattern strings are allocated in one packet, when a certain character of a text string is read, the character may hit a plurality of the 5 pattern strings at the same time, resulting in collision of the pattern strings, which pattern string is hit at the end needs to be further compared, increasing the comparison operation cost, and affecting the matching efficiency.
In an optional embodiment of the present invention, the allocating the n encoded pattern strings to a bit vector table including m packets may specifically include:
step S61, determining that the ith coded mode string is distributed to the misjudgment character string set generated by the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
step S62, determining a first loss gain generated by distributing the ith coded mode string to the jth packet according to the word frequency of each misjudged character string in the misjudged character string set;
step S63, the ith encoded pattern string is assigned to the packet with the smallest first loss gain until the nth encoded pattern string is assigned.
In order to reduce the collision of the pattern strings as much as possible and improve the efficiency of pattern matching, in the process of multi-mode matching, under the condition of giving the number of packets and the length of the packets, the embodiment of the invention distributes n pattern strings to m packets according to the minimum loss gain based on the greedy principle.
In summary, before pattern matching, the embodiment of the present invention performs word segmentation on a pattern string to obtain a first word segmentation set corresponding to the pattern string, and encodes each word segmentation in the first word segmentation set to obtain an encoded pattern string; allocating a bit vector table corresponding to the encoded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the pattern string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
Firstly, the embodiment of the invention carries out word segmentation on the pattern string and then encodes the pattern string, which not only can correct semantic errors and filter meaningless matching results, but also can compress the value range length of the B table by the encoding after word segmentation, reduce the space occupied by the cache and improve the utilization rate of the B table.
In addition, the embodiment of the invention selects the target substring through the word frequency of the substring, can reduce the probability of generating the collision mode string when the target substring is hit, and further can improve the subsequent query efficiency.
Furthermore, under the condition that the target substring has the collision mode string, the query sequence of the collision mode string is determined based on the word frequency of the mode string, so that the query times are reduced as much as possible, and the query efficiency is improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Device embodiment
Referring to fig. 2, a block diagram of a structure of an embodiment of a pattern matching apparatus of the present invention is shown, where the apparatus may specifically include:
the first word segmentation module 201 is configured to perform word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
a first encoding module 202, configured to encode each participle in the first participle set to obtain an encoded mode string;
an allocating module 203, configured to allocate a bit vector table corresponding to the encoded pattern string, where a bit in the bit vector table is used to indicate whether a corresponding participle appears in the pattern string;
and the pattern matching module 204 is configured to match the pattern string with a text string to be matched based on a bit parallel algorithm according to the bit vector table.
Optionally, the apparatus further comprises:
the second word segmentation module is used for segmenting words of the text string to obtain a second word segmentation set corresponding to the text string;
the second coding module is used for coding each participle in the second participle set to obtain a coded text string;
the pattern matching module is specifically configured to match the encoded pattern string with the encoded text string based on a bit parallel algorithm according to the bit vector table.
Optionally, the apparatus further comprises:
the substring dividing module is used for dividing the coded pattern string according to the grouping length when the length of the coded pattern string is greater than the preset grouping length to obtain each divided substring;
the target determining module is used for determining a target substring in each substring;
the allocation module is specifically configured to allocate a bit vector table corresponding to the target substring;
the pattern matching module comprises:
the matching sub-module is used for matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring to obtain the position of the matching string in the text string;
and the query submodule is used for querying whether the text string hits the pattern string corresponding to the target substring according to the position of the matching string in the text string.
Optionally, the target determining module is specifically configured to determine, in each of the substrings, the substring with the smallest word frequency as the target substring.
Optionally, the query submodule includes:
the word frequency calculation unit is used for calculating the word frequency of each collision mode string if the target substring has the collision mode string;
and the query comparison unit is used for sequentially querying whether the text string hits the collision mode string or not according to the positions of the matching strings in the text string and the sequence of the word frequencies of the collision mode string from large to small.
Optionally, if the number n of the pattern strings to be matched is greater than 1, the allocating module includes:
the grouping determination submodule is used for determining the grouping number m of the bit vector table;
a pattern string allocating submodule for allocating the n encoded pattern strings to a bit vector table containing m packets.
Optionally, the packet allocation sub-module includes:
the first calculation unit is used for determining that the ith coded mode string is distributed to the misjudgment character string set generated by the jth group; wherein, the value of i is 1-n, and the value of j is 1-m;
a second calculating unit, configured to determine, according to a word frequency of each misjudged character string in the misjudged character string set, a first loss gain generated by allocating the ith encoded pattern string to the jth packet;
a pattern string allocating unit, configured to allocate the ith encoded pattern string to a packet with a minimum first loss gain until the nth encoded pattern string is allocated.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
An embodiment of the present invention provides an apparatus for pattern matching, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs including instructions for: performing word segmentation on the mode string to obtain a first word segmentation set corresponding to the mode string; encoding each participle in the first participle set to obtain an encoded mode string; distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
Fig. 3 is a block diagram illustrating an apparatus 800 for pattern matching according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 3, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user, in some embodiments, the screen may include a liquid crystal display (L CD) and a Touch Panel (TP). if the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), programmable logic devices (P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows server, Mac OS XTM, UnixTM, &lttttranslation = L "&tttl &/t &gttinuxtm, FreeBSDTM, and so forth.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the pattern matching method shown in fig. 1.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a pattern matching method, the method comprising: performing word segmentation on the mode string to obtain a first word segmentation set corresponding to the mode string; encoding each participle in the first participle set to obtain an encoded mode string; distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
The embodiment of the invention discloses A1 and a pattern matching method, which comprises the following steps:
performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
encoding each participle in the first participle set to obtain an encoded mode string;
distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
A2, before matching the pattern string with the text string to be matched based on a bit-parallel algorithm according to the bit vector table according to the method of A1, the method further comprising:
performing word segmentation on the text string to obtain a second word segmentation set corresponding to the text string;
encoding each participle in the second participle set to obtain an encoded text string;
matching the pattern string with a text string to be matched based on a bit parallel algorithm according to the bit vector table, including:
and matching the coded mode string with the coded text string based on a bit parallel algorithm according to the bit vector table.
A3, before the allocating the bit vector table corresponding to the encoded pattern string according to the method of A1, the method further comprising:
when the length of the coded pattern string is larger than a preset grouping length, dividing the coded pattern string according to the grouping length to obtain each divided substring;
determining a target substring in the substrings;
the allocating a bit vector table corresponding to the encoded pattern string includes:
distributing bit vector tables corresponding to the target substrings;
matching the pattern string with a text string to be matched based on a bit parallel algorithm according to the bit vector table, including:
matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring to obtain the position of the matching string in the text string;
and inquiring whether the text string hits a pattern string corresponding to the target substring or not according to the position of the matching string in the text string.
A4, according to the method of A3, the determining the target sub-string among the respective sub-strings includes:
and determining the substring with the minimum word frequency as a target substring in each substring.
A5, according to the method in A3, the querying whether the text string hits the pattern string corresponding to the target substring according to the position of the matching string in the text string includes:
if the target substring has collision mode strings, respectively calculating the word frequency of each collision mode string;
and sequentially inquiring whether the text string hits the collision mode string or not according to the positions of the matching strings in the text string and the sequence of the word frequency of the collision mode string from large to small.
A6, according to the method of a1, if the number n of the pattern strings to be matched is greater than 1, the allocating a bit vector table corresponding to the encoded pattern strings includes:
determining the grouping number m of a bit vector table;
the n encoded pattern strings are allocated into a bit vector table containing m packets.
A7, according to the method of A6, the allocating n encoded pattern strings into a bit vector table containing m packets includes:
determining that the ith coded mode string is distributed to a misjudgment character string set generated by the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
determining a first loss gain generated by distributing the ith coded mode string to the jth packet according to the word frequency of each misjudged character string in the misjudged character string set;
and distributing the ith coded mode string to a packet with the minimum first loss gain until the nth coded mode string is completely distributed.
The embodiment of the invention discloses B8 and a pattern matching device, wherein the device comprises:
the first word segmentation module is used for segmenting words of the pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
the first coding module is used for coding each participle in the first participle set to obtain a coded mode string;
the distribution module is used for distributing a bit vector table corresponding to the coded mode string, and bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string or not;
and the pattern matching module is used for matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
B9, the apparatus of B8, the apparatus further comprising:
the second word segmentation module is used for segmenting words of the text string to obtain a second word segmentation set corresponding to the text string;
the second coding module is used for coding each participle in the second participle set to obtain a coded text string;
the pattern matching module is specifically configured to match the encoded pattern string with the encoded text string based on a bit parallel algorithm according to the bit vector table.
B10, the apparatus of B8, the apparatus further comprising:
the substring dividing module is used for dividing the coded pattern string according to the grouping length when the length of the coded pattern string is greater than the preset grouping length to obtain each divided substring;
the target determining module is used for determining a target substring in each substring;
the allocation module is specifically configured to allocate a bit vector table corresponding to the target substring;
the pattern matching module comprises:
the matching sub-module is used for matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring to obtain the position of the matching string in the text string;
and the query submodule is used for querying whether the text string hits the pattern string corresponding to the target substring according to the position of the matching string in the text string.
And B11, according to the apparatus of B10, the target determining module is specifically configured to determine, as the target substring, the substring with the smallest word frequency.
B12, the apparatus according to B10, the query submodule comprising:
the word frequency calculation unit is used for calculating the word frequency of each collision mode string if the target substring has the collision mode string;
and the query comparison unit is used for sequentially querying whether the text string hits the collision mode string or not according to the positions of the matching strings in the text string and the sequence of the word frequencies of the collision mode string from large to small.
B13, according to the apparatus of B8, if the number n of the pattern strings to be matched is greater than 1, the allocating module includes:
the grouping determination submodule is used for determining the grouping number m of the bit vector table;
a pattern string allocating submodule for allocating the n encoded pattern strings to a bit vector table containing m packets.
B14, the apparatus of B13, the grouping assignment submodule comprising:
the first calculation unit is used for determining that the ith coded mode string is distributed to the misjudgment character string set generated by the jth group; wherein, the value of i is 1-n, and the value of j is 1-m;
a second calculating unit, configured to determine, according to a word frequency of each misjudged character string in the misjudged character string set, a first loss gain generated by allocating the ith encoded pattern string to the jth packet;
a pattern string allocating unit, configured to allocate the ith encoded pattern string to a packet with a minimum first loss gain until the nth encoded pattern string is allocated.
The embodiment of the invention discloses C15, an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:
performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
encoding each participle in the first participle set to obtain an encoded mode string;
distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
C16, the device of C15, the device also configured to execute the one or more programs by one or more processors including instructions for:
performing word segmentation on the text string to obtain a second word segmentation set corresponding to the text string;
encoding each participle in the second participle set to obtain an encoded text string;
matching the pattern string with a text string to be matched based on a bit parallel algorithm according to the bit vector table, including:
and matching the coded mode string with the coded text string based on a bit parallel algorithm according to the bit vector table.
C17, the device of claim 15, the device also configured to execute the one or more programs by one or more processors including instructions for:
when the length of the coded pattern string is larger than a preset grouping length, dividing the coded pattern string according to the grouping length to obtain each divided substring;
determining a target substring in the substrings;
the allocating a bit vector table corresponding to the encoded pattern string includes:
distributing bit vector tables corresponding to the target substrings;
matching the pattern string with a text string to be matched based on a bit parallel algorithm according to the bit vector table, including:
matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring to obtain the position of the matching string in the text string;
and inquiring whether the text string hits a pattern string corresponding to the target substring or not according to the position of the matching string in the text string.
C18, the determining a target sub-string among the respective sub-strings according to the apparatus of C17, including:
and determining the substring with the minimum word frequency as a target substring in each substring.
C19, the apparatus according to C17, the querying whether the text string hits the pattern string corresponding to the target sub-string according to the position of the matching string in the text string, comprising:
if the target substring has collision mode strings, respectively calculating the word frequency of each collision mode string;
and sequentially inquiring whether the text string hits the collision mode string or not according to the positions of the matching strings in the text string and the sequence of the word frequency of the collision mode string from large to small.
C20, according to the apparatus of C15, if the number n of the pattern strings to be matched is greater than 1, the allocating a bit vector table corresponding to the encoded pattern string includes:
determining the grouping number m of a bit vector table;
the n encoded pattern strings are allocated into a bit vector table containing m packets.
C21, the apparatus of C20, the allocating n encoded pattern strings into a bit vector table containing m packets, comprising:
determining that the ith coded mode string is distributed to a misjudgment character string set generated by the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
determining a first loss gain generated by distributing the ith coded mode string to the jth packet according to the word frequency of each misjudged character string in the misjudged character string set;
and distributing the ith coded mode string to a packet with the minimum first loss gain until the nth coded mode string is completely distributed.
Embodiments of the present invention disclose D22, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a pattern matching method as described in one or more of a 1-a 7.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
The above detailed description is provided for a pattern matching method, a pattern matching device and a device for pattern matching, and the specific examples are applied herein to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method of pattern matching, the method comprising:
performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
encoding each participle in the first participle set to obtain an encoded mode string;
distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
2. The method of claim 1, wherein before matching the pattern string with a text string to be matched based on a bit-parallel algorithm according to the bit vector table, the method further comprises:
performing word segmentation on the text string to obtain a second word segmentation set corresponding to the text string;
encoding each participle in the second participle set to obtain an encoded text string;
matching the pattern string with a text string to be matched based on a bit parallel algorithm according to the bit vector table, including:
and matching the coded mode string with the coded text string based on a bit parallel algorithm according to the bit vector table.
3. The method of claim 1, wherein before said allocating the bit vector table corresponding to the encoded pattern string, the method further comprises:
when the length of the coded pattern string is larger than a preset grouping length, dividing the coded pattern string according to the grouping length to obtain each divided substring;
determining a target substring in the substrings;
the allocating a bit vector table corresponding to the encoded pattern string includes:
distributing bit vector tables corresponding to the target substrings;
matching the pattern string with a text string to be matched based on a bit parallel algorithm according to the bit vector table, including:
matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring to obtain the position of the matching string in the text string;
and inquiring whether the text string hits a pattern string corresponding to the target substring or not according to the position of the matching string in the text string.
4. The method of claim 3, wherein said determining a target sub-string among said respective sub-strings comprises:
and determining the substring with the minimum word frequency as a target substring in each substring.
5. The method according to claim 3, wherein said querying whether the text string hits the pattern string corresponding to the target substring according to the position of the matching string in the text string comprises:
if the target substring has collision mode strings, respectively calculating the word frequency of each collision mode string;
and sequentially inquiring whether the text string hits the collision mode string or not according to the positions of the matching strings in the text string and the sequence of the word frequency of the collision mode string from large to small.
6. The method according to claim 1, wherein if the number n of the pattern strings to be matched is greater than 1, the allocating the bit vector table corresponding to the encoded pattern string comprises:
determining the grouping number m of a bit vector table;
the n encoded pattern strings are allocated into a bit vector table containing m packets.
7. The method of claim 6, wherein said assigning the n encoded pattern strings into a bit vector table containing m packets comprises:
determining that the ith coded mode string is distributed to a misjudgment character string set generated by the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
determining a first loss gain generated by distributing the ith coded mode string to the jth packet according to the word frequency of each misjudged character string in the misjudged character string set;
and distributing the ith coded mode string to a packet with the minimum first loss gain until the nth coded mode string is completely distributed.
8. A pattern matching apparatus, characterized in that the apparatus comprises:
the first word segmentation module is used for segmenting words of the pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
the first coding module is used for coding each participle in the first participle set to obtain a coded mode string;
the distribution module is used for distributing a bit vector table corresponding to the coded mode string, and bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string or not;
and the pattern matching module is used for matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
9. An apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:
performing word segmentation on a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
encoding each participle in the first participle set to obtain an encoded mode string;
distributing a bit vector table corresponding to the coded mode string, wherein bits in the bit vector table are used for indicating whether corresponding participles appear in the mode string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
10. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform a pattern matching method as recited in one or more of claims 1-7.
CN202010183402.5A 2020-03-16 2020-03-16 Pattern matching method and device for pattern matching Active CN111400563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010183402.5A CN111400563B (en) 2020-03-16 2020-03-16 Pattern matching method and device for pattern matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010183402.5A CN111400563B (en) 2020-03-16 2020-03-16 Pattern matching method and device for pattern matching

Publications (2)

Publication Number Publication Date
CN111400563A true CN111400563A (en) 2020-07-10
CN111400563B CN111400563B (en) 2023-08-01

Family

ID=71428920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010183402.5A Active CN111400563B (en) 2020-03-16 2020-03-16 Pattern matching method and device for pattern matching

Country Status (1)

Country Link
CN (1) CN111400563B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131866A (en) * 2020-09-25 2020-12-25 马上消费金融股份有限公司 Word segmentation method, device, equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150358264A1 (en) * 2014-06-05 2015-12-10 Pegatron Corporation Information supply method and system, and word string supply system
CN110276071A (en) * 2019-05-24 2019-09-24 众安在线财产保险股份有限公司 A kind of text matching technique, device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150358264A1 (en) * 2014-06-05 2015-12-10 Pegatron Corporation Information supply method and system, and word string supply system
CN110276071A (en) * 2019-05-24 2019-09-24 众安在线财产保险股份有限公司 A kind of text matching technique, device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131866A (en) * 2020-09-25 2020-12-25 马上消费金融股份有限公司 Word segmentation method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN111400563B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
US20070094718A1 (en) Configurable dynamic input word prediction algorithm
CN106257452B (en) Modifying search results based on contextual characteristics
CN112149032B (en) Advertisement interception method and device
CN112861175B (en) Data processing method and device for data processing
CN114090575A (en) Data storage method and retrieval method based on key value database and corresponding devices
CN115145735B (en) Memory allocation method and device and readable storage medium
CN111400563B (en) Pattern matching method and device for pattern matching
CN114428797A (en) Method, device and equipment for caching embedded parameters and storage medium
CN113761565B (en) Data desensitization method and device
CN116644144A (en) Extension number storage method, privacy number binding method and related devices
CN110147426B (en) Method for determining classification label of query text and related device
CN110377654B (en) Data request processing method and device, electronic equipment and computer-readable storage medium
CN113988313A (en) User data deleting method and device and electronic equipment
CN110110292B (en) Data processing method and device for data processing
CN108073566B (en) Word segmentation method and device and word segmentation device
CN112651221A (en) Data processing method and device and data processing device
CN112818710A (en) Method and device for processing asynchronous network machine translation request
CN110019657B (en) Processing method, apparatus and machine-readable medium
CN111708715A (en) Memory allocation method, memory allocation device and terminal equipment
CN111680014A (en) Shared file acquisition method and device, electronic equipment and storage medium
CN111382325B (en) Pattern string allocation method and device for pattern string allocation
CN111460836B (en) Data processing method and device for data processing
CN112987941A (en) Method and device for generating candidate words
CN117909258B (en) Optimization method and device for processor cache, electronic equipment and storage medium
CN113032808B (en) Data processing method and device, readable medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant