Specific embodiment
To keep the purpose of the present invention, feature, advantage more obvious and understandable, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
It is only a part of the embodiment of the present invention, and not all embodiments.Based on the embodiments of the present invention, those skilled in the art are not having
Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.Moreover, particular features, structures, materials, or characteristics described
It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this
The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples
Sign is combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic." first " is defined as a result, the feature of " second " can be expressed or hidden
It include at least one this feature containing ground.In the description of the present invention, the meaning of " plurality " is two or more, unless otherwise
Clear specific restriction.
Embodiment described in following exemplary embodiment does not represent all implementations consistent with this specification
Mode.On the contrary, they are only the sides consistent with some aspects as detailed in the attached claim, in this specification
The example of method, device or equipment.
It has been observed that the efficiency of string matching can be improved using preset matching algorithm in the related art, such as may be used
Using Aho-Corasick automaton algorithm (AC automatic machine algorithm), Boyer Moore algorithm (Danny Boyle-mole algorithm)
Or other algorithms etc..
Referring to FIG. 1, on the one hand the embodiment of the present invention provides a kind of character string matching method, method includes:
Step 101, target string dictionary is obtained;
Step 102, according to target string dictionary creation index information, index information includes in target string dictionary
Character string conditional attribute;
Step 103, multi-mode matching dictionary tree is constructed according to index information.
The character string matching method provided in the embodiment of the present invention can be applied to client or server, wherein client
Including such as desktop computer, cell phone, even application software client, the embodiment of the present invention is not herein to client applied by method
The concrete form at end is limited.Server in the embodiment of the present invention may include individual server, server cluster even
Platform based on server cluster building.
Wherein, the target string dictionary obtained in step 101 of the embodiment of the present invention may come from local data base or
Person is sent to obtain by other clients or server, the embodiment of the present invention herein not to the acquisition modes of target string dictionary into
Row limitation.
It, can be first to target character according to target string dictionary creation index information in step 102 of the embodiment of the present invention
Dictionary of going here and there carries out unit string and combining characters string extracts, and is then based on extracted unit string and combining characters string building rope
Fuse breath.Wherein, index information may include inverted index information, and certainly, index information can also be other specific index letters
Structure is ceased, the embodiment of the present invention does not limit the specific structure mode of index information herein, closes as long as index can be constructed
System.
The index information constructed in the embodiment of the present invention includes the character string conditional attribute in target string dictionary,
It include the target string of conditional.Wherein, conditional attribute includes the condition of and-or inverter, before "AND" condition refers to condition
Character string descriptor afterwards occurs simultaneously, can be indicated with operator, as A&B indicates that A and B binding occurs;And what negated condition referred to
It is that character string description before condition occurs, the character string descriptor after condition does not occur, can be with operator~indicate, such as A~B table
Show that A occurs, B does not occur.Further, the case where A&B~C indicates that inquiry string A and B occur, but C does not occur;A&B
~C~D indicates that inquiry string A and B occur, but do not occur C or D and A&B~(C | D) be as effect, so
And-or inverter can actually realize three kinds of actual conditions.The embodiment of the present invention has the index of character string conditional attribute by building
Information realizes and flexibly configures to the keyword of condition, matching process is rapidly completed, so that the character string of conditional is primary
Just pattern match is able to achieve in inquiry.
Step 103, multi-mode matching dictionary tree is constructed according to index information, it has been observed that since index information may include
Inverted index information, such as when former character string are as follows: when A:A&B~C~D, A&M, corresponding inverted index are as follows: A:[[A-&, B-&, C-
~, D-~], [A-&, M-&]].It may be other specific index information structures, therefore the multi-mode constructed according to index information
Matching dictionary tree can also be in a manner of a variety of specific structure, as long as can be constructed based on the index information with character string conditional attribute
The dictionary tree that can carry out multi-mode matching to character string out, can be AC automatic machine or other dictionary trees, and the present invention is implemented
Example does not limit the specific structure mode of multi-mode matching dictionary tree herein.
In an embodiment, as shown in Fig. 2, this method further include:
Step 104, inquiry string is obtained;
Step 105, multi-mode matching dictionary tree is matched according to inquiry string, obtains multiple matched character strings;
Step 106, it is screened from index information according to multiple matched character strings, obtains matching target string,
Conditional attribute with target string is identical as the conditional attribute of inquiry string.
The inquiry string obtained in step 104 of the embodiment of the present invention can be even one for an article, a paragraph
Any text of words etc.;After getting inquiry string, then pass through step 105 according to inquiry string to multi-mode matching dictionary
Tree is matched, and matching process is rapidly completed, and return to multiple matched character strings simultaneously.Finally further according to multiple be matched to
It is screened from index information with character string, obtains conditional attribute matching target identical with the conditional attribute of inquiry string
Character string, to realize the character string pattern matching for supporting conditional.
The embodiment of the present invention has the index information of character string conditional attribute by building, recycles according to this index information
The multi-mode matching dictionary tree of building matches inquiry string, obtains matched character string, finally again based on obtained
It is screened from index information with character string and obtains conditional attribute matching target character identical with the conditional attribute of inquiry string
String, realizes the character string pattern matching of conditional, effectively increases the matching efficiency of combining characters string.
In an embodiment, according to target string dictionary creation index information, comprising:
It is ranked up duplicate removal according to the conditional attribute in target string dictionary between target string, after obtaining duplicate removal
Single goal character string and composite object character string, the conditional attribute include with or and it is three kinds non-;
According to after duplicate removal single goal character string and composite object character string establish inverted index, obtain index information,
Index information includes conditional attribute of each single goal character string in corresponding composite object character string.
In the embodiment of the present invention, after obtaining target string dictionary, duplicate removal first is ranked up to the different condition of dictionary point,
Resources occupation rate when follow-up data processing can be effectively reduced, character string search efficiency is helped to improve.Specifically, condition category
Property in, it is the same after the sequence of "AND" condition such as A&B~C~D and B&A~D~C, it is the same after negated condition sequence, so last protect
Stay A&B~C~D, the as character string after duplicate removal.It include single goal character string and combination further according to the character string after duplicate removal
Target string establishes inverted index, the candidate target character string as character string.
It include former character string such as Fig. 3, in target string dictionary are as follows:
A&B~C~D, B&A~C~D, A&M, E
After the duplicate removal that sorts are as follows: A&B~C~D, A&M, E
The inverted index established according to such character string are as follows:
A:[[A-&, B-&, C-~, D-~], [A-&, M-&]]
B:[A-&, B-&, C-~, D-~]
C:[A-&, B-&, C-~, D-~]
D:[A-&, B-&, C-~, D-~]
E:[E-&]
Wherein, the index information of foundation, condition is stored in the attribute of character string, such as aforementioned inverted index, includes
The conditional attribute of and-or inverter of each single goal character string in corresponding composite object character string includes that each character string occurs
Conditional composite object character string information, for realize conditional character string inquiry provide necessary condition.The present invention
In embodiment, single goal character string and composite object character string after duplicate removal can be established based on Hash mapping or other modes
Inverted index information.
In an embodiment, multi-mode matching dictionary tree is the dictionary that AC automatic machine and even numbers group dictionary tree combine
Tree.
In the embodiment of the present invention, AC automatic machine is a kind of string search algorithm, and multi-mode matching can be completed at a high speed by having
The characteristics of, however implement clever whether, determines final performance height, AC automatic machine often due to huge space complexity and
The performance consumption of some functions, causes overall performance to reduce.And even numbers group dictionary tree can complete at a high speed single String matching, and memory
Consumption is controllable, however weakness is multi-mode matching.Therefore the dictionary tree combined using AC automatic machine and even numbers group dictionary tree, i.e.,
It uses the dictionary tree of even numbers group dictionary tree expression AC automatic machine as the multi-mode matching dictionary tree in the embodiment of the present invention, just can
The advantages of both set, dictionary tree is indicated using only two linear arrays, has very big promotion on space efficiency, using participle
Field is relatively more, and data structure performance is high.
Specifically, when constructing the dictionary tree that AC automatic machine and even numbers group dictionary tree combine according to index information, it can first root
Even numbers group dictionary tree is constructed according to index information, AC automatic machine is then constructed based on even numbers group dictionary tree again, just can obtain AC certainly
The dictionary tree that motivation and even numbers group dictionary tree combine, is finally stored.
In an embodiment, multi-mode matching dictionary tree is constructed according to index information further include:
Obtain the scene properties for corresponding to single goal character string;
It is combined according to the scene properties of index information and single goal character string building AC automatic machine and even numbers group dictionary tree
Dictionary tree.
Scene properties in the embodiment of the present invention can for classification information, part-of-speech information or other exist for setting character string
Using when attribute information.The wherein attribute informations such as classification information such as food, personage;Part-of-speech information such as prototype word, alternative word
Equal attribute informations.By constructing dictionary tree according to the scene properties of index information and single goal character string, can not only realize
The character string of conditional attribute with and-or inverter is inquired, additionally it is possible to the character string inquiry with other attribute conditions is realized, to realize
Support the quick multi-pattern match scheme of many condition.Also, by the way that condition to be stored in the attribute of character string, have
Effect saves space consuming, helps to improve search efficiency.
It in an embodiment, is screened from index information according to multiple matched character strings, obtains matching target
Character string includes:
The consistent bitmap of quantity of all unit strings in size and multi-mode matching dictionary tree is set;
Corresponding position of multiple matched character strings on bitmap is labeled as true value;
All character strings for corresponding to each matched character string are obtained according to the attribute of multiple matched character strings;
The corresponding position met on bitmap in all character strings for corresponding to each matched character string is all marked
There is the character string of true value to be determined as matching target string.
In the embodiment of the present invention, multi-mode matching dictionary tree is matched according to inquiry string, obtains multiple matchings
After character string, matching condition attribute is screened from index information further according to obtained multiple matched character strings and meets polling character
The character string of the conditional attribute of string certainly, can be with since multi-mode matching dictionary tree is constructed based on index information
It is screened from multi-mode matching dictionary tree according to multiple matched character string information.Specifically, the sieve in the embodiment of the present invention
Be selected as conditional filtering, i.e., filtered out from index information conditional attribute it is identical with the conditional attribute of inquiry string correspond to
Matching target string with character string.
Wherein conditional filtering is specifically, correspond to index information, i.e. institute in candidate target character string by first setting size
There is the bitmap function of the quantity of unit string, or it has been observed that setting size corresponds to all individual characters in multi-mode matching dictionary tree
Accord with the bitmap function of the quantity of string, wherein each bit on bitmap corresponds to each aforementioned unit string, then root
Bitmap function carries out bitmap initial setting up accordingly.Then the multiple matched character strings aforementioned matching obtained are opposite on bitmap
The position mark answered is true value, so that the conditional attribute of matched character string is recorded in bitmap.Specifically, as worked as polling character
String are as follows: when M ... N ... F, matched character string cannot be accessed, then directly terminating this secondary index task;Such as work as inquiry string
Are as follows: when A ... E ... F, matching unit string can be obtained are as follows: the condition of A, E namely matched character string includes: non-F;So, will
A, opposite position of the E on bitmap is set as very, and F is shown as false on bitmap, namely includes matched character string " non-F "
Conditional attribute.Then all characters corresponding to each matched character string are got according to the attribute of multiple matched character strings
String, finally judging each of this all character string character string again, whether corresponding position on bitmap is all true, namely
The all genuine character string selections in corresponding position met on bitmap in all character strings are come out, are determined as matching target word
Symbol string, just realizes the conditional attribute that filtered out matching target string is all satisfied " non-F ", namely realize matching target
The conditional attribute of character string and the identical requirement of the conditional attribute of inquiry string, finally by the details of each matching result
Output, such as one output for " beginIndex ": 0, " endIndex ": 20, " word ": " A&B~C~D ", " label ": "
Politics " }, all matching results are finally integrated, just completes and this time matches, index task can be terminated, prepare to restart
Next task.Condition is stored in the attribute of character string by the embodiment of the present invention, and bitmap (BitSet) is utilized in while inquiring,
It indicates whether an input text has hit some character string, saves space consuming, improve the search efficiency of character string.
Referring to FIG. 4, on the other hand the embodiment of the present invention provides a kind of string matching equipment, comprising:
Data acquisition module 201, for obtaining target string dictionary;
Data processing module 202, for according to target string dictionary creation index information, index information to include target word
Character string conditional attribute in symbol string dictionary;It is also used to construct multi-mode matching dictionary tree according to index information;
Wherein, the target string dictionary that data acquisition module of the embodiment of the present invention 201 obtains may come from local number
It sends to obtain according to library or by other clients or server, the not acquisition to target string dictionary herein of the embodiment of the present invention
Mode is limited.
Data processing module of the embodiment of the present invention 202, can be first to mesh according to target string dictionary creation index information
It marks String Dictionary and carries out unit string and the extraction of combining characters string, be then based on extracted unit string and combining characters string
Construct index information.Wherein, index information may include inverted index information, and certainly, index information can also be specific for other
Index information structure, the embodiment of the present invention does not limit the specific structure mode of index information herein, as long as can construct
Index relative.
The index information constructed in the embodiment of the present invention includes the character string conditional attribute in target string dictionary,
It include the target string of conditional.Wherein, conditional attribute includes the condition of and-or inverter, before "AND" condition refers to condition
Character string descriptor afterwards occurs simultaneously, can be indicated with operator, as A&B indicates that A and B binding occurs;And what negated condition referred to
It is that character string description before condition occurs, the character string descriptor after condition does not occur, can be with operator~indicate, such as A~B table
Show that A occurs, B does not occur.Further, the case where A&B~C indicates that inquiry string A and B occur, but C does not occur;A&B
~C~D indicates that inquiry string A and B occur, but do not occur C or D and A&B~(C | D) be as effect, so
And-or inverter can actually realize three kinds of actual conditions.The embodiment of the present invention has the index of character string conditional attribute by building
Information realizes and flexibly configures to the keyword of condition, matching process is rapidly completed, so that the character string of conditional is primary
Just pattern match is able to achieve in inquiry.
Data processing module 202 constructs multi-mode matching dictionary tree according to index information, it has been observed that since index information can
To include inverted index information, such as when former character string are as follows: when A:A&B~C~D, A&M, corresponding inverted index are as follows: A:[[A-&,
B-&, C-~, D-~], [A-&, M-&]].It may be other specific index information structures, therefore according to index information building
Multi-mode matching dictionary tree can also be in a manner of a variety of specific structures, as long as can be believed based on the index with character string conditional attribute
Breath constructs the dictionary tree that multi-mode matching can be carried out to character string, can be AC automatic machine or other dictionary trees, this hair
Bright embodiment does not limit the specific structure mode of multi-mode matching dictionary tree herein.
In an embodiment, data acquisition module 201 is also used to obtain inquiry string;
Data processing module 202 is also used to match multi-mode matching dictionary tree according to inquiry string, obtain more
A matched character string;
Data processing module 202 is also used to be screened from index information according to multiple matched character strings, be matched
Target string, the conditional attribute for matching target string are identical as the conditional attribute of inquiry string.
The inquiry string obtained in data acquisition module of the embodiment of the present invention 201 can be an article, a paragraph
Even any text of a word etc.;After getting inquiry string, then by data processing module 202 according to inquiry string
Multi-mode matching dictionary tree is matched, matching process is rapidly completed, and returns to multiple matched character strings simultaneously.Last root again
It is screened from index information according to the multiple matched character strings being matched to, obtains the condition category of conditional attribute and inquiry string
Property identical matching target string, to realize the character string pattern matching for supporting conditional.
The embodiment of the present invention has the index information of character string conditional attribute by building, recycles according to this index information
The multi-mode matching dictionary tree of building matches inquiry string, obtains matched character string, finally again based on obtained
It is screened from index information with character string and obtains conditional attribute matching target character identical with the conditional attribute of inquiry string
String, realizes the character string pattern matching of conditional, effectively increases the matching efficiency of combining characters string.
In an embodiment, data processing module 202 is also used to,
It is ranked up duplicate removal according to the conditional attribute in target string dictionary between target string, after obtaining duplicate removal
Single goal character string and composite object character string, the conditional attribute include with or and it is three kinds non-;According to the single goal after duplicate removal
Character string and composite object character string establish inverted index, obtain index information, and index information includes each single goal character string
Conditional attribute in corresponding composite object character string.
In the embodiment of the present invention, after data acquisition module 201 obtains target string dictionary, data processing module 202 is first
Duplicate removal is ranked up to the different condition of dictionary point, resources occupation rate when follow-up data processing can be effectively reduced, facilitated
Improve character string search efficiency.Specifically, such as A&B~C~D and B&A~D~C, the sequence of "AND" condition is latter in conditional attribute
Sample, it is the same after negated condition sequence, so finally retaining A&B~C~D, the as character string after duplicate removal.After duplicate removal
Character string includes single goal character string and composite object character string, establishes inverted index, the candidate target word as character string
Symbol string.
It include former character string such as Fig. 3, in target string dictionary are as follows:
A&B~C~D, B&A~C~D, A&M, E
After the duplicate removal that sorts are as follows: A&B~C~D, A&M, E
The inverted index established according to such character string are as follows:
A:[[A-&, B-&, C-~, D-~], [A-&, M-&]]
B:[A-&, B-&, C-~, D-~]
C:[A-&, B-&, C-~, D-~]
D:[A-&, B-&, C-~, D-~]
E:[E-&]
Wherein, the index information of foundation, condition is stored in the attribute of character string, such as aforementioned inverted index, includes
The conditional attribute of and-or inverter of each single goal character string in corresponding composite object character string includes that each character string occurs
Conditional composite object character string information, for realize conditional character string inquiry provide necessary condition.The present invention
In embodiment, single goal character string and composite object character string after duplicate removal can be established based on Hash mapping or other modes
Inverted index information.
In an embodiment, multi-mode matching dictionary tree is the dictionary that AC automatic machine and even numbers group dictionary tree combine
Tree.
In the embodiment of the present invention, AC automatic machine is a kind of string search algorithm, and multi-mode matching can be completed at a high speed by having
The characteristics of, however implement clever whether, determines final performance height, AC automatic machine often due to huge space complexity and
The performance consumption of some functions, causes overall performance to reduce.And even numbers group dictionary tree can complete at a high speed single String matching, and memory
Consumption is controllable, however weakness is multi-mode matching.Therefore the dictionary tree combined using AC automatic machine and even numbers group dictionary tree, i.e.,
It uses the dictionary tree of even numbers group dictionary tree expression AC automatic machine as the multi-mode matching dictionary tree in the embodiment of the present invention, just can
The advantages of both set, dictionary tree is indicated using only two linear arrays, has very big promotion on space efficiency, using participle
Field is relatively more, and data structure performance is high.
Specifically, when constructing the dictionary tree that AC automatic machine and even numbers group dictionary tree combine according to index information, it can first root
Even numbers group dictionary tree is constructed according to index information, AC automatic machine is then constructed based on even numbers group dictionary tree again, just can obtain AC certainly
The dictionary tree that motivation and even numbers group dictionary tree combine, is finally stored.
In an embodiment, data acquisition module 201 is also used to, and obtains the scene category for corresponding to single goal character string
Property;Data processing module 202 is also used to, and constructs AC automatic machine and double according to the scene properties of index information and single goal character string
The dictionary tree that array dictionary tree combines.
Scene properties in the embodiment of the present invention can for classification information, part-of-speech information or other exist for setting character string
Using when attribute information.The wherein attribute informations such as classification information such as food, personage;Part-of-speech information such as prototype word, alternative word
Equal attribute informations.By constructing dictionary tree according to the scene properties of index information and single goal character string, can not only realize
The character string of conditional attribute with and-or inverter is inquired, additionally it is possible to the character string inquiry with other attribute conditions is realized, to realize
Support the quick multi-pattern match scheme of many condition.Also, by the way that condition to be stored in the attribute of character string, have
Effect saves space consuming, helps to improve search efficiency.
In an embodiment, data processing module 202 is also used to, institute in setting size and multi-mode matching dictionary tree
There is the consistent bitmap of the quantity of unit string;Data processing module 202 is also used to, by phase of multiple matched character strings on bitmap
Answering position mark is true value;Data acquisition module 201 is also used to, and is obtained according to the attribute of multiple matched character strings corresponding to each
All character strings of a matched character string;Data processing module 202 is also used to, and will correspond to all of each matched character string
Meeting the corresponding position on bitmap in character string all marks the character string having to be determined as matching target string.
In the embodiment of the present invention, multi-mode matching dictionary tree is matched according to inquiry string, obtains multiple matchings
After character string, matching condition attribute is screened from index information further according to obtained multiple matched character strings and meets polling character
The character string of the conditional attribute of string certainly, can be with since multi-mode matching dictionary tree is constructed based on index information
It is screened from multi-mode matching dictionary tree according to multiple matched character string information.Specifically, the sieve in the embodiment of the present invention
Be selected as conditional filtering, i.e., filtered out from index information conditional attribute it is identical with the conditional attribute of inquiry string correspond to
Matching target string with character string.
Wherein conditional filtering is specifically, correspond to index information, i.e. institute in candidate target character string by first setting size
There is the bitmap function of the quantity of unit string, or it has been observed that setting size corresponds to all individual characters in multi-mode matching dictionary tree
Accord with the bitmap function of the quantity of string, wherein each bit on bitmap corresponds to each aforementioned unit string, then root
Bitmap function carries out bitmap initial setting up accordingly.Then the multiple matched character strings aforementioned matching obtained are opposite on bitmap
The position mark answered is true value, so that the conditional attribute of matched character string is recorded in bitmap.Specifically, as worked as polling character
String are as follows: when M ... N ... F, matched character string cannot be accessed, then directly terminating this secondary index task;Such as work as inquiry string
Are as follows: when A ... E ... F, matching unit string can be obtained are as follows: the condition of A, E namely matched character string includes: non-F;So, will
A, opposite position of the E on bitmap is set as very, and F is shown as false on bitmap, namely includes matched character string " non-F "
Conditional attribute.Then all characters corresponding to each matched character string are got according to the attribute of multiple matched character strings
String, finally judging each of this all character string character string again, whether corresponding position on bitmap is all true, namely
The all genuine characters in corresponding position met on bitmap in all character strings are screened, are determined as matching target character
String, just realizes the conditional attribute that filtered out matching target string is all satisfied " non-F ", namely realize matching target word
The conditional attribute requirement identical with the conditional attribute of inquiry string of string is accorded with, it is finally that the details of each matching result are defeated
Out, as one output for " beginIndex ": 0, " endIndex ": 20, " word ": " A&B~C~D ", " label ": "
Politics " }, all matching results are finally integrated, just completes and this time matches, index task can be terminated, prepare to restart
Next task.Condition is stored in the attribute of character string by the embodiment of the present invention, and bitmap (BitSet) is utilized in while inquiring,
It indicates whether an input text has hit some character string, saves space consuming, improve the search efficiency of character string.
Another aspect of the present invention provides a kind of computer readable storage medium, and it is executable that computer is stored in storage medium
Instruction, when executed for executing the close word querying method of any of the above-described.
It need to be noted that: the description of above embodiments, be with the description of above method embodiment it is similar, have
The similar beneficial effect with embodiment of the method please refers to present invention side for technical detail undisclosed in the embodiment of the present invention
The description of method embodiment and understand, to save length, therefore repeat no more.
In the embodiment of the present invention, the realization sequence between multiple steps can be replaced in the case where not influencing and realizing purpose
It changes.
More than, only a specific embodiment of the invention, but scope of protection of the present invention is not limited thereto, and it is any to be familiar with
Those skilled in the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all cover
Within protection scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.