CN110516118A - A kind of character string matching method, equipment and computer storage medium - Google Patents

A kind of character string matching method, equipment and computer storage medium Download PDF

Info

Publication number
CN110516118A
CN110516118A CN201910743568.5A CN201910743568A CN110516118A CN 110516118 A CN110516118 A CN 110516118A CN 201910743568 A CN201910743568 A CN 201910743568A CN 110516118 A CN110516118 A CN 110516118A
Authority
CN
China
Prior art keywords
string
character string
index information
dictionary
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910743568.5A
Other languages
Chinese (zh)
Inventor
李喜莲
林士翔
雷欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Volkswagen China Investment Co Ltd
Mobvoi Innovation Technology Co Ltd
Original Assignee
Go Out And Ask (wuhan) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Go Out And Ask (wuhan) Information Technology Co Ltd filed Critical Go Out And Ask (wuhan) Information Technology Co Ltd
Priority to CN201910743568.5A priority Critical patent/CN110516118A/en
Publication of CN110516118A publication Critical patent/CN110516118A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of character string matching method, equipment and computer storage mediums, which comprises obtains target string dictionary;According to the target string dictionary creation index information, the index information includes the character string conditional attribute in target string dictionary;Multi-mode matching dictionary tree is constructed according to the index information.The embodiment of the present invention realizes the character string pattern matching of conditional, effectively increases the matching efficiency of combining characters string.

Description

A kind of character string matching method, equipment and computer storage medium
Technical field
The present invention relates to technical field of data processing more particularly to a kind of character string matching method, equipment and computer to deposit Storage media.
Background technique
With the continuous development of information technology, machine plays increasingly important role in people's daily life, Therefore, it in machine function design, is particularly important for the conversational system function with human communication.
Various keyword query modules have been generally included in conversational system, for carrying out multi-pattern match, such as It searches passage has matched which phrase in dictionary, can be used for such as the sensitive word inquiry in anti-spam task.Existing Method of completing the square one kind is based on Hash mapping (hash map), but its room and time complexity and dictionary size are positively correlated;Other one Kind implementation method is dictionary tree, is had a distinct increment in room and time efficiency.But for multiple target strings, these are existing It is multiple that scheme requires inquiry;And it cannot all support the target string pattern match of conditional, for example to inquire A and B goes out It the case where existing, but C or D do not occur, can not be completed in one query at present.
Summary of the invention
The embodiment of the present invention creatively provides a kind of character to effectively overcome drawbacks described above present in the prior art String matching method, which comprises obtain target string dictionary;It is indexed and is believed according to the target string dictionary creation Breath, the index information includes the character string conditional attribute in target string dictionary;Multimode is constructed according to the index information Formula matches dictionary tree.
It is described according to the target string dictionary creation index information, comprising: according to described in an embodiment Conditional attribute in target string dictionary between target string is ranked up duplicate removal, the single goal character string after obtaining duplicate removal With composite object character string, the conditional attribute include with or and it is three kinds non-;According to after the duplicate removal single goal character string and Composite object character string establishes inverted index, obtains index information;The index information includes each single goal character string right Answer the conditional attribute in composite object character string.
In an embodiment, the multi-mode matching dictionary tree is AC automatic machine (Aho-Corasick Automaton) and even numbers group dictionary tree combine dictionary tree.
In an embodiment, it is described according to the index information construct multi-mode matching dictionary tree include: obtain pair The scene properties of single goal character string described in Ying Yu;According to the scene properties structure of the index information and the single goal character string Build the dictionary tree that AC automatic machine and even numbers group dictionary tree combine.
In an embodiment, the method also includes: obtain inquiry string;According to the inquiry string to institute It states multi-mode matching dictionary tree to be matched, obtains multiple matched character strings;According to the multiple matched character string from the rope It is screened in fuse breath, obtains matching target string, the conditional attribute of the matching target string and the inquiry word The conditional attribute for according with string is identical.
It is described to be screened from the index information according to the multiple matched character string in an embodiment, Obtaining matching target string includes: the consistent bitmap of quantity that all unit strings in size and the index information are arranged; Corresponding position of the multiple matched character string on the bitmap is labeled as true value;According to the multiple matched character string Attribute obtains all character strings for corresponding to each matched character string;Described it will correspond to each described matching character Meeting the corresponding position on the bitmap in all character strings of string all marks the character string having to be determined as matching mesh Mark character string.
On the other hand the embodiment of the present invention provides a kind of string matching equipment, comprising: data acquisition module, for obtaining Target string dictionary;Data processing module, for according to the target string dictionary creation index information, the index letter Breath includes the character string conditional attribute in target string dictionary;The data processing module is also used to, and is believed according to the index Breath building multi-mode matching dictionary tree.
In an embodiment, the data processing module is also used to, according to target in the target string dictionary Conditional attribute between character string is ranked up duplicate removal, single goal character string and composite object character string after obtaining duplicate removal, institute State conditional attribute include with or and it is three kinds non-;The data processing module is also used to, according to the single goal character after the duplicate removal String and composite object character string establish inverted index, obtain index information, the index information includes each single goal character string Conditional attribute in corresponding composite object character string.
In an embodiment, the multi-mode matching dictionary tree is the word that AC automatic machine and even numbers group dictionary tree combine Allusion quotation tree.
On the other hand the embodiment of the present invention provides a kind of computer readable storage medium, deposit in the computer storage medium Computer executable instructions are contained, when executed for executing described in any item string matching sides among the above Method.
Character string matching method, equipment and computer storage medium provided in an embodiment of the present invention have word by building The index information of symbol string conditional attribute, recycles the multi-mode matching dictionary tree constructed according to this index information to inquiry string It is matched, obtains matched character string, finally screened from index information based on obtained matched character string again and obtain condition category Property it is identical with the conditional attribute of inquiry string matching combine character string, realize the character string pattern matching of conditional, have Effect improves the matching efficiency of combining characters string.
Detailed description of the invention
The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:
In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.
Fig. 1 is a kind of a kind of implementation process schematic diagram of character string matching method provided by one embodiment of the invention;
Fig. 2 is a kind of another implementation process schematic diagram of character string matching method provided by one embodiment of the invention;
Fig. 3 is a kind of a kind of specific implementation flow signal of character string matching method provided by one embodiment of the invention Figure;
Fig. 4 is a kind of a kind of composite structural diagram of string matching equipment provided by one embodiment of the invention.
Specific embodiment
To keep the purpose of the present invention, feature, advantage more obvious and understandable, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only It is only a part of the embodiment of the present invention, and not all embodiments.Based on the embodiments of the present invention, those skilled in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.Moreover, particular features, structures, materials, or characteristics described It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples Sign is combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic." first " is defined as a result, the feature of " second " can be expressed or hidden It include at least one this feature containing ground.In the description of the present invention, the meaning of " plurality " is two or more, unless otherwise Clear specific restriction.
Embodiment described in following exemplary embodiment does not represent all implementations consistent with this specification Mode.On the contrary, they are only the sides consistent with some aspects as detailed in the attached claim, in this specification The example of method, device or equipment.
It has been observed that the efficiency of string matching can be improved using preset matching algorithm in the related art, such as may be used Using Aho-Corasick automaton algorithm (AC automatic machine algorithm), Boyer Moore algorithm (Danny Boyle-mole algorithm) Or other algorithms etc..
Referring to FIG. 1, on the one hand the embodiment of the present invention provides a kind of character string matching method, method includes:
Step 101, target string dictionary is obtained;
Step 102, according to target string dictionary creation index information, index information includes in target string dictionary Character string conditional attribute;
Step 103, multi-mode matching dictionary tree is constructed according to index information.
The character string matching method provided in the embodiment of the present invention can be applied to client or server, wherein client Including such as desktop computer, cell phone, even application software client, the embodiment of the present invention is not herein to client applied by method The concrete form at end is limited.Server in the embodiment of the present invention may include individual server, server cluster even Platform based on server cluster building.
Wherein, the target string dictionary obtained in step 101 of the embodiment of the present invention may come from local data base or Person is sent to obtain by other clients or server, the embodiment of the present invention herein not to the acquisition modes of target string dictionary into Row limitation.
It, can be first to target character according to target string dictionary creation index information in step 102 of the embodiment of the present invention Dictionary of going here and there carries out unit string and combining characters string extracts, and is then based on extracted unit string and combining characters string building rope Fuse breath.Wherein, index information may include inverted index information, and certainly, index information can also be other specific index letters Structure is ceased, the embodiment of the present invention does not limit the specific structure mode of index information herein, closes as long as index can be constructed System.
The index information constructed in the embodiment of the present invention includes the character string conditional attribute in target string dictionary, It include the target string of conditional.Wherein, conditional attribute includes the condition of and-or inverter, before "AND" condition refers to condition Character string descriptor afterwards occurs simultaneously, can be indicated with operator, as A&B indicates that A and B binding occurs;And what negated condition referred to It is that character string description before condition occurs, the character string descriptor after condition does not occur, can be with operator~indicate, such as A~B table Show that A occurs, B does not occur.Further, the case where A&B~C indicates that inquiry string A and B occur, but C does not occur;A&B ~C~D indicates that inquiry string A and B occur, but do not occur C or D and A&B~(C | D) be as effect, so And-or inverter can actually realize three kinds of actual conditions.The embodiment of the present invention has the index of character string conditional attribute by building Information realizes and flexibly configures to the keyword of condition, matching process is rapidly completed, so that the character string of conditional is primary Just pattern match is able to achieve in inquiry.
Step 103, multi-mode matching dictionary tree is constructed according to index information, it has been observed that since index information may include Inverted index information, such as when former character string are as follows: when A:A&B~C~D, A&M, corresponding inverted index are as follows: A:[[A-&, B-&, C- ~, D-~], [A-&, M-&]].It may be other specific index information structures, therefore the multi-mode constructed according to index information Matching dictionary tree can also be in a manner of a variety of specific structure, as long as can be constructed based on the index information with character string conditional attribute The dictionary tree that can carry out multi-mode matching to character string out, can be AC automatic machine or other dictionary trees, and the present invention is implemented Example does not limit the specific structure mode of multi-mode matching dictionary tree herein.
In an embodiment, as shown in Fig. 2, this method further include:
Step 104, inquiry string is obtained;
Step 105, multi-mode matching dictionary tree is matched according to inquiry string, obtains multiple matched character strings;
Step 106, it is screened from index information according to multiple matched character strings, obtains matching target string, Conditional attribute with target string is identical as the conditional attribute of inquiry string.
The inquiry string obtained in step 104 of the embodiment of the present invention can be even one for an article, a paragraph Any text of words etc.;After getting inquiry string, then pass through step 105 according to inquiry string to multi-mode matching dictionary Tree is matched, and matching process is rapidly completed, and return to multiple matched character strings simultaneously.Finally further according to multiple be matched to It is screened from index information with character string, obtains conditional attribute matching target identical with the conditional attribute of inquiry string Character string, to realize the character string pattern matching for supporting conditional.
The embodiment of the present invention has the index information of character string conditional attribute by building, recycles according to this index information The multi-mode matching dictionary tree of building matches inquiry string, obtains matched character string, finally again based on obtained It is screened from index information with character string and obtains conditional attribute matching target character identical with the conditional attribute of inquiry string String, realizes the character string pattern matching of conditional, effectively increases the matching efficiency of combining characters string.
In an embodiment, according to target string dictionary creation index information, comprising:
It is ranked up duplicate removal according to the conditional attribute in target string dictionary between target string, after obtaining duplicate removal Single goal character string and composite object character string, the conditional attribute include with or and it is three kinds non-;
According to after duplicate removal single goal character string and composite object character string establish inverted index, obtain index information,
Index information includes conditional attribute of each single goal character string in corresponding composite object character string.
In the embodiment of the present invention, after obtaining target string dictionary, duplicate removal first is ranked up to the different condition of dictionary point, Resources occupation rate when follow-up data processing can be effectively reduced, character string search efficiency is helped to improve.Specifically, condition category Property in, it is the same after the sequence of "AND" condition such as A&B~C~D and B&A~D~C, it is the same after negated condition sequence, so last protect Stay A&B~C~D, the as character string after duplicate removal.It include single goal character string and combination further according to the character string after duplicate removal Target string establishes inverted index, the candidate target character string as character string.
It include former character string such as Fig. 3, in target string dictionary are as follows:
A&B~C~D, B&A~C~D, A&M, E
After the duplicate removal that sorts are as follows: A&B~C~D, A&M, E
The inverted index established according to such character string are as follows:
A:[[A-&, B-&, C-~, D-~], [A-&, M-&]]
B:[A-&, B-&, C-~, D-~]
C:[A-&, B-&, C-~, D-~]
D:[A-&, B-&, C-~, D-~]
E:[E-&]
Wherein, the index information of foundation, condition is stored in the attribute of character string, such as aforementioned inverted index, includes The conditional attribute of and-or inverter of each single goal character string in corresponding composite object character string includes that each character string occurs Conditional composite object character string information, for realize conditional character string inquiry provide necessary condition.The present invention In embodiment, single goal character string and composite object character string after duplicate removal can be established based on Hash mapping or other modes Inverted index information.
In an embodiment, multi-mode matching dictionary tree is the dictionary that AC automatic machine and even numbers group dictionary tree combine Tree.
In the embodiment of the present invention, AC automatic machine is a kind of string search algorithm, and multi-mode matching can be completed at a high speed by having The characteristics of, however implement clever whether, determines final performance height, AC automatic machine often due to huge space complexity and The performance consumption of some functions, causes overall performance to reduce.And even numbers group dictionary tree can complete at a high speed single String matching, and memory Consumption is controllable, however weakness is multi-mode matching.Therefore the dictionary tree combined using AC automatic machine and even numbers group dictionary tree, i.e., It uses the dictionary tree of even numbers group dictionary tree expression AC automatic machine as the multi-mode matching dictionary tree in the embodiment of the present invention, just can The advantages of both set, dictionary tree is indicated using only two linear arrays, has very big promotion on space efficiency, using participle Field is relatively more, and data structure performance is high.
Specifically, when constructing the dictionary tree that AC automatic machine and even numbers group dictionary tree combine according to index information, it can first root Even numbers group dictionary tree is constructed according to index information, AC automatic machine is then constructed based on even numbers group dictionary tree again, just can obtain AC certainly The dictionary tree that motivation and even numbers group dictionary tree combine, is finally stored.
In an embodiment, multi-mode matching dictionary tree is constructed according to index information further include:
Obtain the scene properties for corresponding to single goal character string;
It is combined according to the scene properties of index information and single goal character string building AC automatic machine and even numbers group dictionary tree Dictionary tree.
Scene properties in the embodiment of the present invention can for classification information, part-of-speech information or other exist for setting character string Using when attribute information.The wherein attribute informations such as classification information such as food, personage;Part-of-speech information such as prototype word, alternative word Equal attribute informations.By constructing dictionary tree according to the scene properties of index information and single goal character string, can not only realize The character string of conditional attribute with and-or inverter is inquired, additionally it is possible to the character string inquiry with other attribute conditions is realized, to realize Support the quick multi-pattern match scheme of many condition.Also, by the way that condition to be stored in the attribute of character string, have Effect saves space consuming, helps to improve search efficiency.
It in an embodiment, is screened from index information according to multiple matched character strings, obtains matching target Character string includes:
The consistent bitmap of quantity of all unit strings in size and multi-mode matching dictionary tree is set;
Corresponding position of multiple matched character strings on bitmap is labeled as true value;
All character strings for corresponding to each matched character string are obtained according to the attribute of multiple matched character strings;
The corresponding position met on bitmap in all character strings for corresponding to each matched character string is all marked There is the character string of true value to be determined as matching target string.
In the embodiment of the present invention, multi-mode matching dictionary tree is matched according to inquiry string, obtains multiple matchings After character string, matching condition attribute is screened from index information further according to obtained multiple matched character strings and meets polling character The character string of the conditional attribute of string certainly, can be with since multi-mode matching dictionary tree is constructed based on index information It is screened from multi-mode matching dictionary tree according to multiple matched character string information.Specifically, the sieve in the embodiment of the present invention Be selected as conditional filtering, i.e., filtered out from index information conditional attribute it is identical with the conditional attribute of inquiry string correspond to Matching target string with character string.
Wherein conditional filtering is specifically, correspond to index information, i.e. institute in candidate target character string by first setting size There is the bitmap function of the quantity of unit string, or it has been observed that setting size corresponds to all individual characters in multi-mode matching dictionary tree Accord with the bitmap function of the quantity of string, wherein each bit on bitmap corresponds to each aforementioned unit string, then root Bitmap function carries out bitmap initial setting up accordingly.Then the multiple matched character strings aforementioned matching obtained are opposite on bitmap The position mark answered is true value, so that the conditional attribute of matched character string is recorded in bitmap.Specifically, as worked as polling character String are as follows: when M ... N ... F, matched character string cannot be accessed, then directly terminating this secondary index task;Such as work as inquiry string Are as follows: when A ... E ... F, matching unit string can be obtained are as follows: the condition of A, E namely matched character string includes: non-F;So, will A, opposite position of the E on bitmap is set as very, and F is shown as false on bitmap, namely includes matched character string " non-F " Conditional attribute.Then all characters corresponding to each matched character string are got according to the attribute of multiple matched character strings String, finally judging each of this all character string character string again, whether corresponding position on bitmap is all true, namely The all genuine character string selections in corresponding position met on bitmap in all character strings are come out, are determined as matching target word Symbol string, just realizes the conditional attribute that filtered out matching target string is all satisfied " non-F ", namely realize matching target The conditional attribute of character string and the identical requirement of the conditional attribute of inquiry string, finally by the details of each matching result Output, such as one output for " beginIndex ": 0, " endIndex ": 20, " word ": " A&B~C~D ", " label ": " Politics " }, all matching results are finally integrated, just completes and this time matches, index task can be terminated, prepare to restart Next task.Condition is stored in the attribute of character string by the embodiment of the present invention, and bitmap (BitSet) is utilized in while inquiring, It indicates whether an input text has hit some character string, saves space consuming, improve the search efficiency of character string.
Referring to FIG. 4, on the other hand the embodiment of the present invention provides a kind of string matching equipment, comprising:
Data acquisition module 201, for obtaining target string dictionary;
Data processing module 202, for according to target string dictionary creation index information, index information to include target word Character string conditional attribute in symbol string dictionary;It is also used to construct multi-mode matching dictionary tree according to index information;
Wherein, the target string dictionary that data acquisition module of the embodiment of the present invention 201 obtains may come from local number It sends to obtain according to library or by other clients or server, the not acquisition to target string dictionary herein of the embodiment of the present invention Mode is limited.
Data processing module of the embodiment of the present invention 202, can be first to mesh according to target string dictionary creation index information It marks String Dictionary and carries out unit string and the extraction of combining characters string, be then based on extracted unit string and combining characters string Construct index information.Wherein, index information may include inverted index information, and certainly, index information can also be specific for other Index information structure, the embodiment of the present invention does not limit the specific structure mode of index information herein, as long as can construct Index relative.
The index information constructed in the embodiment of the present invention includes the character string conditional attribute in target string dictionary, It include the target string of conditional.Wherein, conditional attribute includes the condition of and-or inverter, before "AND" condition refers to condition Character string descriptor afterwards occurs simultaneously, can be indicated with operator, as A&B indicates that A and B binding occurs;And what negated condition referred to It is that character string description before condition occurs, the character string descriptor after condition does not occur, can be with operator~indicate, such as A~B table Show that A occurs, B does not occur.Further, the case where A&B~C indicates that inquiry string A and B occur, but C does not occur;A&B ~C~D indicates that inquiry string A and B occur, but do not occur C or D and A&B~(C | D) be as effect, so And-or inverter can actually realize three kinds of actual conditions.The embodiment of the present invention has the index of character string conditional attribute by building Information realizes and flexibly configures to the keyword of condition, matching process is rapidly completed, so that the character string of conditional is primary Just pattern match is able to achieve in inquiry.
Data processing module 202 constructs multi-mode matching dictionary tree according to index information, it has been observed that since index information can To include inverted index information, such as when former character string are as follows: when A:A&B~C~D, A&M, corresponding inverted index are as follows: A:[[A-&, B-&, C-~, D-~], [A-&, M-&]].It may be other specific index information structures, therefore according to index information building Multi-mode matching dictionary tree can also be in a manner of a variety of specific structures, as long as can be believed based on the index with character string conditional attribute Breath constructs the dictionary tree that multi-mode matching can be carried out to character string, can be AC automatic machine or other dictionary trees, this hair Bright embodiment does not limit the specific structure mode of multi-mode matching dictionary tree herein.
In an embodiment, data acquisition module 201 is also used to obtain inquiry string;
Data processing module 202 is also used to match multi-mode matching dictionary tree according to inquiry string, obtain more A matched character string;
Data processing module 202 is also used to be screened from index information according to multiple matched character strings, be matched Target string, the conditional attribute for matching target string are identical as the conditional attribute of inquiry string.
The inquiry string obtained in data acquisition module of the embodiment of the present invention 201 can be an article, a paragraph Even any text of a word etc.;After getting inquiry string, then by data processing module 202 according to inquiry string Multi-mode matching dictionary tree is matched, matching process is rapidly completed, and returns to multiple matched character strings simultaneously.Last root again It is screened from index information according to the multiple matched character strings being matched to, obtains the condition category of conditional attribute and inquiry string Property identical matching target string, to realize the character string pattern matching for supporting conditional.
The embodiment of the present invention has the index information of character string conditional attribute by building, recycles according to this index information The multi-mode matching dictionary tree of building matches inquiry string, obtains matched character string, finally again based on obtained It is screened from index information with character string and obtains conditional attribute matching target character identical with the conditional attribute of inquiry string String, realizes the character string pattern matching of conditional, effectively increases the matching efficiency of combining characters string.
In an embodiment, data processing module 202 is also used to,
It is ranked up duplicate removal according to the conditional attribute in target string dictionary between target string, after obtaining duplicate removal Single goal character string and composite object character string, the conditional attribute include with or and it is three kinds non-;According to the single goal after duplicate removal Character string and composite object character string establish inverted index, obtain index information, and index information includes each single goal character string Conditional attribute in corresponding composite object character string.
In the embodiment of the present invention, after data acquisition module 201 obtains target string dictionary, data processing module 202 is first Duplicate removal is ranked up to the different condition of dictionary point, resources occupation rate when follow-up data processing can be effectively reduced, facilitated Improve character string search efficiency.Specifically, such as A&B~C~D and B&A~D~C, the sequence of "AND" condition is latter in conditional attribute Sample, it is the same after negated condition sequence, so finally retaining A&B~C~D, the as character string after duplicate removal.After duplicate removal Character string includes single goal character string and composite object character string, establishes inverted index, the candidate target word as character string Symbol string.
It include former character string such as Fig. 3, in target string dictionary are as follows:
A&B~C~D, B&A~C~D, A&M, E
After the duplicate removal that sorts are as follows: A&B~C~D, A&M, E
The inverted index established according to such character string are as follows:
A:[[A-&, B-&, C-~, D-~], [A-&, M-&]]
B:[A-&, B-&, C-~, D-~]
C:[A-&, B-&, C-~, D-~]
D:[A-&, B-&, C-~, D-~]
E:[E-&]
Wherein, the index information of foundation, condition is stored in the attribute of character string, such as aforementioned inverted index, includes The conditional attribute of and-or inverter of each single goal character string in corresponding composite object character string includes that each character string occurs Conditional composite object character string information, for realize conditional character string inquiry provide necessary condition.The present invention In embodiment, single goal character string and composite object character string after duplicate removal can be established based on Hash mapping or other modes Inverted index information.
In an embodiment, multi-mode matching dictionary tree is the dictionary that AC automatic machine and even numbers group dictionary tree combine Tree.
In the embodiment of the present invention, AC automatic machine is a kind of string search algorithm, and multi-mode matching can be completed at a high speed by having The characteristics of, however implement clever whether, determines final performance height, AC automatic machine often due to huge space complexity and The performance consumption of some functions, causes overall performance to reduce.And even numbers group dictionary tree can complete at a high speed single String matching, and memory Consumption is controllable, however weakness is multi-mode matching.Therefore the dictionary tree combined using AC automatic machine and even numbers group dictionary tree, i.e., It uses the dictionary tree of even numbers group dictionary tree expression AC automatic machine as the multi-mode matching dictionary tree in the embodiment of the present invention, just can The advantages of both set, dictionary tree is indicated using only two linear arrays, has very big promotion on space efficiency, using participle Field is relatively more, and data structure performance is high.
Specifically, when constructing the dictionary tree that AC automatic machine and even numbers group dictionary tree combine according to index information, it can first root Even numbers group dictionary tree is constructed according to index information, AC automatic machine is then constructed based on even numbers group dictionary tree again, just can obtain AC certainly The dictionary tree that motivation and even numbers group dictionary tree combine, is finally stored.
In an embodiment, data acquisition module 201 is also used to, and obtains the scene category for corresponding to single goal character string Property;Data processing module 202 is also used to, and constructs AC automatic machine and double according to the scene properties of index information and single goal character string The dictionary tree that array dictionary tree combines.
Scene properties in the embodiment of the present invention can for classification information, part-of-speech information or other exist for setting character string Using when attribute information.The wherein attribute informations such as classification information such as food, personage;Part-of-speech information such as prototype word, alternative word Equal attribute informations.By constructing dictionary tree according to the scene properties of index information and single goal character string, can not only realize The character string of conditional attribute with and-or inverter is inquired, additionally it is possible to the character string inquiry with other attribute conditions is realized, to realize Support the quick multi-pattern match scheme of many condition.Also, by the way that condition to be stored in the attribute of character string, have Effect saves space consuming, helps to improve search efficiency.
In an embodiment, data processing module 202 is also used to, institute in setting size and multi-mode matching dictionary tree There is the consistent bitmap of the quantity of unit string;Data processing module 202 is also used to, by phase of multiple matched character strings on bitmap Answering position mark is true value;Data acquisition module 201 is also used to, and is obtained according to the attribute of multiple matched character strings corresponding to each All character strings of a matched character string;Data processing module 202 is also used to, and will correspond to all of each matched character string Meeting the corresponding position on bitmap in character string all marks the character string having to be determined as matching target string.
In the embodiment of the present invention, multi-mode matching dictionary tree is matched according to inquiry string, obtains multiple matchings After character string, matching condition attribute is screened from index information further according to obtained multiple matched character strings and meets polling character The character string of the conditional attribute of string certainly, can be with since multi-mode matching dictionary tree is constructed based on index information It is screened from multi-mode matching dictionary tree according to multiple matched character string information.Specifically, the sieve in the embodiment of the present invention Be selected as conditional filtering, i.e., filtered out from index information conditional attribute it is identical with the conditional attribute of inquiry string correspond to Matching target string with character string.
Wherein conditional filtering is specifically, correspond to index information, i.e. institute in candidate target character string by first setting size There is the bitmap function of the quantity of unit string, or it has been observed that setting size corresponds to all individual characters in multi-mode matching dictionary tree Accord with the bitmap function of the quantity of string, wherein each bit on bitmap corresponds to each aforementioned unit string, then root Bitmap function carries out bitmap initial setting up accordingly.Then the multiple matched character strings aforementioned matching obtained are opposite on bitmap The position mark answered is true value, so that the conditional attribute of matched character string is recorded in bitmap.Specifically, as worked as polling character String are as follows: when M ... N ... F, matched character string cannot be accessed, then directly terminating this secondary index task;Such as work as inquiry string Are as follows: when A ... E ... F, matching unit string can be obtained are as follows: the condition of A, E namely matched character string includes: non-F;So, will A, opposite position of the E on bitmap is set as very, and F is shown as false on bitmap, namely includes matched character string " non-F " Conditional attribute.Then all characters corresponding to each matched character string are got according to the attribute of multiple matched character strings String, finally judging each of this all character string character string again, whether corresponding position on bitmap is all true, namely The all genuine characters in corresponding position met on bitmap in all character strings are screened, are determined as matching target character String, just realizes the conditional attribute that filtered out matching target string is all satisfied " non-F ", namely realize matching target word The conditional attribute requirement identical with the conditional attribute of inquiry string of string is accorded with, it is finally that the details of each matching result are defeated Out, as one output for " beginIndex ": 0, " endIndex ": 20, " word ": " A&B~C~D ", " label ": " Politics " }, all matching results are finally integrated, just completes and this time matches, index task can be terminated, prepare to restart Next task.Condition is stored in the attribute of character string by the embodiment of the present invention, and bitmap (BitSet) is utilized in while inquiring, It indicates whether an input text has hit some character string, saves space consuming, improve the search efficiency of character string.
Another aspect of the present invention provides a kind of computer readable storage medium, and it is executable that computer is stored in storage medium Instruction, when executed for executing the close word querying method of any of the above-described.
It need to be noted that: the description of above embodiments, be with the description of above method embodiment it is similar, have The similar beneficial effect with embodiment of the method please refers to present invention side for technical detail undisclosed in the embodiment of the present invention The description of method embodiment and understand, to save length, therefore repeat no more.
In the embodiment of the present invention, the realization sequence between multiple steps can be replaced in the case where not influencing and realizing purpose It changes.
More than, only a specific embodiment of the invention, but scope of protection of the present invention is not limited thereto, and it is any to be familiar with Those skilled in the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all cover Within protection scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims (10)

1. a kind of character string matching method, which is characterized in that the described method includes:
Obtain target string dictionary;
According to the target string dictionary creation index information, the index information includes the character in target string dictionary String conditional attribute;
Multi-mode matching dictionary tree is constructed according to the index information.
2. the method according to claim 1, wherein described indexed according to the target string dictionary creation is believed Breath, comprising:
It is ranked up duplicate removal according to the conditional attribute in the target string dictionary between target string, after obtaining duplicate removal Single goal character string and composite object character string, the conditional attribute include with or and it is three kinds non-;
According to after the duplicate removal single goal character string and composite object character string establish inverted index, obtain index information, institute Stating index information includes conditional attribute of each single goal character string in corresponding composite object character string.
3. method according to claim 1 or 2, which is characterized in that the multi-mode matching dictionary tree be AC automatic machine and The dictionary tree that even numbers group dictionary tree combines.
4. according to the method described in claim 3, it is characterized in that, described construct multi-mode matching word according to the index information Allusion quotation tree includes:
Obtain the scene properties for corresponding to the single goal character string;
AC automatic machine and even numbers group dictionary tree knot are constructed according to the scene properties of the index information and the single goal character string The dictionary tree of conjunction.
5. the method according to claim 1, wherein the method also includes:
Obtain inquiry string;
The multi-mode matching dictionary tree is matched according to the inquiry string, obtains multiple matched character strings;
Screened from the index information according to the multiple matched character string, obtain matching target string, described Conditional attribute with target string is identical as the conditional attribute of the inquiry string.
6. according to the method described in claim 5, it is characterized in that, it is described according to the multiple matched character string from the index It is screened in information, obtaining matching target string includes:
The consistent bitmap of quantity with all unit strings in the index information is set;
Corresponding position of the multiple matched character string on the bitmap is labeled as true value;
All character strings for corresponding to each matched character string are obtained according to the attribute of the multiple matched character string;
The corresponding position on the bitmap will be met in all character strings corresponding to each matched character string The character string having all is marked to be determined as matching target string.
7. a kind of string matching equipment characterized by comprising
Data acquisition module, for obtaining target string dictionary;
Data processing module, for according to the target string dictionary creation index information, the index information to include target Character string conditional attribute in String Dictionary;
The data processing module is also used to, and constructs multi-mode matching dictionary tree according to the index information.
8. equipment according to claim 7, which is characterized in that
The data processing module is also used to, according to the conditional attribute in the target string dictionary between target string into Row sequence duplicate removal, single goal character string and composite object character string after obtaining duplicate removal, the conditional attribute include with or and it is non- Three kinds;
The data processing module is also used to, according to after the duplicate removal single goal character string and composite object character string establish fall Row's index, obtains index information, and the index information includes each single goal character string in corresponding composite object character string Conditional attribute.
9. equipment according to claim 6 or 7, which is characterized in that the multi-mode matching dictionary tree be AC automatic machine and The dictionary tree that even numbers group dictionary tree combines.
10. a kind of computer readable storage medium, computer executable instructions are stored in the computer storage medium, work as institute Instruction is stated to be performed for character string matching method described in any one of perform claim requirement 1-6.
CN201910743568.5A 2019-08-13 2019-08-13 A kind of character string matching method, equipment and computer storage medium Pending CN110516118A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910743568.5A CN110516118A (en) 2019-08-13 2019-08-13 A kind of character string matching method, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910743568.5A CN110516118A (en) 2019-08-13 2019-08-13 A kind of character string matching method, equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN110516118A true CN110516118A (en) 2019-11-29

Family

ID=68625585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910743568.5A Pending CN110516118A (en) 2019-08-13 2019-08-13 A kind of character string matching method, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN110516118A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475681A (en) * 2020-03-30 2020-07-31 湖北微源卓越科技有限公司 Multi-mode character string matching system and method
CN112800316A (en) * 2021-02-04 2021-05-14 北京易车互联信息技术有限公司 Search keyword extraction system based on double-array dictionary tree
CN113609249A (en) * 2021-09-09 2021-11-05 北京环境特性研究所 Target model simulation data storage method and device
CN115935961A (en) * 2022-10-27 2023-04-07 安芯网盾(北京)科技有限公司 Multi-mode matching high-performance algorithm and device for realizing multi-stage matching
CN116932838A (en) * 2023-09-13 2023-10-24 浙江寰福科技有限公司 Database-based data query, update and storage method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528647A (en) * 2016-10-15 2017-03-22 北京语联炉火信息技术有限公司 Term matching method based on a cedar double-array trie algorithm
CN106980656A (en) * 2017-03-10 2017-07-25 北京大学 A kind of searching method based on two-value code dictionary tree
CN109241360A (en) * 2018-08-21 2019-01-18 阿里巴巴集团控股有限公司 The matching process and device and electronic equipment of combining characters string
US10346485B1 (en) * 2014-08-08 2019-07-09 Google Llc Semi structured question answering system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10346485B1 (en) * 2014-08-08 2019-07-09 Google Llc Semi structured question answering system
CN106528647A (en) * 2016-10-15 2017-03-22 北京语联炉火信息技术有限公司 Term matching method based on a cedar double-array trie algorithm
CN106980656A (en) * 2017-03-10 2017-07-25 北京大学 A kind of searching method based on two-value code dictionary tree
CN109241360A (en) * 2018-08-21 2019-01-18 阿里巴巴集团控股有限公司 The matching process and device and electronic equipment of combining characters string

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
中国科学院软件研究所: "《计算机系统应用》", 30 June 2018 *
刘宇,张敬会: "《基于AC自动机和地址概率模型的地址标准化算法》", 《基于AC自动机和地址概率模型的地址标准化算法》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475681A (en) * 2020-03-30 2020-07-31 湖北微源卓越科技有限公司 Multi-mode character string matching system and method
CN111475681B (en) * 2020-03-30 2023-05-16 湖北微源卓越科技有限公司 Multi-mode character string matching system and method
CN112800316A (en) * 2021-02-04 2021-05-14 北京易车互联信息技术有限公司 Search keyword extraction system based on double-array dictionary tree
CN113609249A (en) * 2021-09-09 2021-11-05 北京环境特性研究所 Target model simulation data storage method and device
CN113609249B (en) * 2021-09-09 2023-04-28 北京环境特性研究所 Target model simulation data storage method and device
CN115935961A (en) * 2022-10-27 2023-04-07 安芯网盾(北京)科技有限公司 Multi-mode matching high-performance algorithm and device for realizing multi-stage matching
CN116932838A (en) * 2023-09-13 2023-10-24 浙江寰福科技有限公司 Database-based data query, update and storage method and device
CN116932838B (en) * 2023-09-13 2023-11-24 浙江寰福科技有限公司 Database-based data query, update and storage method and device

Similar Documents

Publication Publication Date Title
CN110516118A (en) A kind of character string matching method, equipment and computer storage medium
CN110990638B (en) Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment
CN107544982B (en) Text information processing method and device and terminal
CN108255958A (en) Data query method, apparatus and storage medium
CN110275935A (en) Processing method, device and storage medium, the electronic device of policy information
CN106033416A (en) A string processing method and device
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN108520002A (en) Data processing method, server and computer storage media
CN105930362B (en) Search for target identification method, device and terminal
CN106874411B (en) A kind of searching method and search platform of table
CN101820398A (en) Instant messenger for dynamically managing messaging group and method thereof
CN105335481B (en) A kind of the suffix index building method and device of extensive character string text
CN101777064A (en) Image searching system and method
CN101488927A (en) Method for managing literal information by an instant communication device and the instant communication device
CN102262670A (en) Cross-media information retrieval system and method based on mobile visual equipment
CN107784110A (en) A kind of index establishing method and device
CN107633022A (en) Personnel's portrait analysis method, device and storage medium
CN103914570A (en) Intelligent customer service searching method and system based on character string similarity algorithm
CN107741972A (en) A kind of searching method of picture, terminal device and storage medium
CN110209659A (en) A kind of resume filter method, system and computer readable storage medium
KR101472451B1 (en) System and Method for Managing Digital Contents
WO2018213783A1 (en) Computerized methods of data compression and analysis
CN110263021B (en) Theme library generation method based on personalized label system
CN103929499B (en) A kind of Internet of Things isomery index identification method and system
CN107679121A (en) Mapping method and device, storage medium, the computing device of taxonomic hierarchies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211201

Address after: 210000 8th floor, building D11, Hongfeng science and Technology Park, Nanjing Economic and Technological Development Zone, Jiangsu Province

Applicant after: New Technology Co.,Ltd.

Applicant after: VOLKSWAGEN (CHINA) INVESTMENT Co.,Ltd.

Address before: 430223 floor 30, building a, block K18, poly times, No. 332, Guanshan Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant before: Go out and ask (Wuhan) Information Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191129