A kind of method and system of quick-searching similar character string
Technical field
This application involves technical field of information processing, in particular to the method for a kind of quick-searching similar character string and it is
System.
Background technique
In current technical field of information processing, it is often necessary to inquiry and target string in the textual entry of magnanimity
Similar character string, existing algorithm be each character string in the textual entry to target string and magnanimity calculate editor away from
From, and all character strings that editing distance is less than some threshold value are classified as similar character string.
This method time complexity in the prior art is high, in the case where hundreds of thousands of textual entries often performance without
Method reaches commercial require.In addition to needing the textual entry number that compares, the time complexity of existing algorithm also with all texts
The character string average length of entry is related, can not be using in the scene of big data quantity by now.
Summary of the invention
A kind of method and system for being designed to provide quick-searching similar character string of the application embodiment, Neng Gouji
The big speed for improving string search.
To achieve the above object, on the one hand the application provides the method for a kind of quick-searching similar character string, the method
Include:
The textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split as
Several phrases, and corresponding weighted value is distributed for each phrase;
Weighted value based on distribution carries out Hash operation to the phrase after fractionation, corresponding to obtain the textual entry
First Hash character string;
Processing of collapsing is carried out to the first Hash character string, to obtain the second Hash character that length meets specified requirements
String;
Prefix trees are established to the second Hash character string, and based on the prefix trees from the existing preset quantity
Character string similar with target string is retrieved in textual entry.
In the present embodiment, distributing corresponding weighted value for each phrase includes:
According to the relevance of current phrase and the textual entry, corresponding weighted value is distributed for the current word group;Its
In, the relevance is higher, and corresponding weighted value is then bigger.
In the present embodiment, carrying out Hash operation to the phrase after fractionation includes:
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the text
The corresponding first Hash character string of entry.
In the present embodiment, include: to the first Hash character string processing that collapse
The first Hash character string is split as multiple substrings according to fixed intervals, and for split obtain it is each
Substring distributes same weighted value;
Using SimHash algorithm, to after fractionation substring and its corresponding weighted value handle, it is described to obtain
The corresponding third Hash character string of first Hash character string;
If desired, the third Hash character string is cut, so that the second Hash character string after cutting
Length is less than the length of the first Hash character string, and the second Hash character string and the first Hash character string it
Between corresponding relationship do not isolated.
In the present embodiment, retrieving character string similar with target string includes:
The target string is split as several phrases, and distributes corresponding weighted value for each phrase;
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the target
The corresponding 4th Hash character string of character string;
Processing of collapsing is carried out to the 4th Hash character string, to obtain the of length less than the 4th Hash character string
Five Hash character strings;
The 5th Hash character string is retrieved in the prefix trees, to obtain the first results set;
New prefix trees are established to first results set, and to the 4th Hash word in the new prefix trees
Symbol string is retrieved, to obtain the second results set;
Using second results set as the set of character string similar with the 4th Hash character string.
In the present embodiment, retrieved from the textual entry of the existing preset quantity based on the prefix trees with
The similar character string of target string includes:
S51: the successively search downwards since the top mode of the prefix trees, and calculate present node and the target word
Editing distance between symbol string;
S52: when the editing distance is less than specified threshold, repeating step S51, to complete the search to child node;
S53: when the editing distance reaches the specified threshold, stop the son of present node and the present node
The search process of node, and successively searched since the next node for being in the brotgher of node at the same level with the present node
Rope;
S54: if without child node in present node, then it is assumed that the corresponding Hash character string of the node and target string
Hash character string be similar, and stop the retrieving of present node, be then at the same level from the present node
The next node of the brotgher of node starts successively to scan for;
S55: when node all in the prefix trees traversed finish or search process stop when, terminate similar character
Accord with the retrieving of string.
To achieve the above object, the application also provides a kind of system of quick-searching similar character string, the system comprises:
Textual entry processing unit for reading the textual entry of existing preset quantity, and is directed to every textual entry,
The textual entry is split as several phrases, and distributes corresponding weighted value for each phrase;
First Hash character string determination unit carries out Hash fortune to the phrase after fractionation for the weighted value based on distribution
It calculates, to obtain the corresponding first Hash character string of the textual entry;
Collapse processing unit, for carrying out processing of collapsing to the first Hash character string, with obtain length meet it is specified
Second Hash character string of condition;
Retrieval unit, for establishing prefix trees to the second Hash character string, and based on the prefix trees from it is described
Character string similar with target string is retrieved in the textual entry of some preset quantities.
In the present embodiment, the processing unit of collapsing includes:
Module is split, for the first Hash character string to be split as multiple substrings according to fixed intervals, and is
It splits obtained each substring and distributes same weighted value;
SimHash module, for utilize SimHash algorithm, to after fractionation substring and its corresponding weighted value into
Row processing, to obtain the corresponding third Hash character string of the first Hash character string;
Module is cut, for cutting to the third Hash character string, so that the second Hash character after cutting
The length of string is less than the length of the first Hash character string, and the second Hash character string and the first Hash character
Corresponding relationship between string is not isolated.
In the present embodiment, the retrieval unit includes:
Target string processing module for the target string to be split as several phrases, and is each phrase point
With corresponding weighted value;
4th Hash character string determining module, for utilize SimHash algorithm, to after fractionation phrase and its corresponding power
Weight values are handled, to obtain the corresponding 4th Hash character string of the target string;
Collapse processing module, for carrying out processing of collapsing to the 4th Hash character string, with obtain length be less than it is described
5th Hash character string of the 4th Hash character string;
Intermediate retrieval module, for being retrieved in the prefix trees to the 5th Hash character string, to obtain
One results set;
Retrieval module again, for establishing new prefix trees to first results set, and in the new prefix trees
In the 4th Hash character string is retrieved, to obtain the second results set;
As a result determining module, for using second results set as character similar with the 4th Hash character string
The set of string.
In the present embodiment, the retrieval unit includes:
Editing distance computing module for the successively search downwards since the top mode of the prefix trees, and calculates and works as
Editing distance between front nodal point and the Hash character string of the target string;
First determination module, for repeating the editing distance and calculating mould when the editing distance is less than specified threshold
The treatment process of block, to complete the search to child node;
Second determination module, for stopping present node and institute when the editing distance reaches the specified threshold
The search process of the child node of present node is stated, and is opened from the next node for being in the brotgher of node at the same level with the present node
Beginning successively scans for;
Third determination module, if be used in present node without child node, then it is assumed that the corresponding Hash character of the node
String and the Hash character string of target string are similar, and stop the retrieving of present node, then from it is described currently
The next node that node is in the brotgher of node at the same level starts successively to scan for;
Ending module is retrieved, has traversed and has finished or search process stops for all nodes in the prefix trees
When, terminate the retrieving of similar character string.
The extremely complex and huge text affinity matching process of operand can be converted to have several and closed by the present invention
The lookup of the prefix trees of connection relationship or dynamic generation can be matched roughly the same in a certain range by control similarity threshold
Similar Text.The time complexity of the algorithm compared with the editing distance between calculating character string one by one for, small several numbers
Magnitude, to greatly improve recall precision.
Referring to following description and accompanying drawings, specific implementations of the present application are disclosed in detail, specify the original of the application
Reason can be in a manner of adopted.It should be understood that presently filed embodiment is not so limited in range.In appended power
In the range of the spirit and terms that benefit requires, presently filed embodiment includes many changes, modifications and is equal.
The feature for describing and/or showing for a kind of embodiment can be in a manner of same or similar one or more
It uses in a other embodiment, is combined with the feature in other embodiment, or the feature in substitution other embodiment.
It should be emphasized that term "comprises/comprising" refers to the presence of feature, one integral piece, step or component when using herein, but simultaneously
It is not excluded for the presence or additional of one or more other features, one integral piece, step or component.
Detailed description of the invention
Included attached drawing is used to provide to be further understood from the application embodiment, and which constitute the one of specification
The principle of the application for illustrating presently filed embodiment, and with verbal description is come together to illustrate in part.It should be evident that
The accompanying drawings in the following description is only some embodiments of the application, for those of ordinary skill in the art, is not being paid
Out under the premise of creative labor, it is also possible to obtain other drawings based on these drawings.In the accompanying drawings:
Fig. 1 is the Establishing process figure of prefix trees in the application embodiment;
Fig. 2 is the retrieval flow figure of similar character string in the application embodiment;
Fig. 3 is the retrieval flow figure of similar character string in the application another embodiment;
Fig. 4 is the functional block diagram of the system of quick-searching similar character string in the application embodiment.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality
The attached drawing in mode is applied, the technical solution in the application embodiment is clearly and completely described, it is clear that described
Embodiment is only a part of embodiment of the application, rather than whole embodiments.Based on the embodiment party in the application
Formula, all other embodiment obtained by those of ordinary skill in the art without making creative efforts, is all answered
When the range for belonging to the application protection.
Referring to Fig. 1, the application provides a kind of method of quick-searching similar character string, which comprises
S1: the textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split
For several phrases, and corresponding weighted value is distributed for each phrase;
S2: the weighted value based on distribution carries out Hash operation to the phrase after fractionation, corresponding to obtain the textual entry
The first Hash character string;
S3: processing of collapsing is carried out to the first Hash character string, to obtain the second Hash that length meets specified requirements
Character string;
S4: prefix trees are established to the second Hash character string, and are based on the prefix trees from the existing present count
Character string similar with target string is retrieved in the textual entry of amount.
In the present embodiment, distributing corresponding weighted value for each phrase includes:
According to the relevance of current phrase and the textual entry, corresponding weighted value is distributed for the current word group;Its
In, the relevance is higher, and corresponding weighted value is then bigger.
In the present embodiment, the relevance of current phrase and textual entry can by calculate vector between space away from
From determining.Specifically, current phrase and textual entry can be converted into term vector, in this way, by calculate two words to
Space length between amount may thereby determine that the relevance between the two, and distance is closer, and relevance is then higher.
In the present embodiment, carrying out Hash operation to the phrase after fractionation includes:
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the text
The corresponding first Hash character string of entry.
In the present embodiment, include: to the first Hash character string processing that collapse
The first Hash character string is split as multiple substrings according to fixed intervals, and for split obtain it is each
Substring distributes same weighted value;
Using SimHash algorithm, to after fractionation substring and its corresponding weighted value handle, it is described to obtain
The corresponding third Hash character string of first Hash character string;
The third Hash character string is cut, so that the length of the second Hash character string after cutting is less than institute
The length of the first Hash character string is stated, and the corresponding pass between the second Hash character string and the first Hash character string
System is not isolated.
Referring to Fig. 2, in the present embodiment, retrieving character string similar with target string includes:
The target string is split as several phrases, and distributes corresponding weighted value for each phrase;
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the target
The corresponding 4th Hash character string of character string;
Processing of collapsing is carried out to the 4th Hash character string, to obtain the of length less than the 4th Hash character string
Five Hash character strings;
The 5th Hash character string is retrieved in the prefix trees, to obtain the first results set;
New prefix trees are established to first results set, and to the 4th Hash word in the new prefix trees
Symbol string is retrieved, to obtain the second results set;
Using second results set as the set of character string similar with the 4th Hash character string.
Specifically, in an application scenarios, similar character string can be retrieved according to the following steps:
Existing magnanimity character string to be treated is segmented according to segmentation methods, and the text after participle is extracted
Feature;
Different weights are assigned to different characteristic, the Hash operation of local sensitivity is carried out to it using SimHash algorithm, obtains
Hash character string H1;
Several character fields H2 is cut into according to fixed character interval to H1, it is (logical that consistent weight is arranged to each character field
It sets up 1), then to carry out SimHash operation again to the character field H2 after cutting, obtains Hash character string H3;
H3 is cut, so that the length of H3 is less than H1, this process is the collapse of Hash, claims collapse algorithm;
Prefix trees T1 is established to H3, while guaranteeing that the corresponding relationship of H1 and H3 is not isolated;
For the character string H4 to be retrieved of input, its cryptographic Hash H5, H6 twice is calculated in the manner previously described;
The quick similar lookup to T1 tree is completed by H6, obtains a set S1;
Prefix trees T2 is established to S1, the quick similar lookup to T2 is completed by H5, finally obtains a set S2;
Set S2 can consider be H4 similar set.
Referring to Fig. 3, in one embodiment of the application, based on the prefix trees from the existing preset quantity
Character string similar with target string is retrieved in textual entry includes:
S51: the successively search downwards since the top mode of the prefix trees, and calculate present node and the target word
Editing distance between symbol string;
S52: when the editing distance is less than specified threshold, repeating step S51, to complete the search to child node;
S53: when the editing distance reaches the specified threshold, stop the son of present node and the present node
The search process of node, and successively searched since the next node for being in the brotgher of node at the same level with the present node
Rope;
S54: if without child node in present node, then it is assumed that the corresponding Hash character string of the node and target string
Hash character string be similar, and stop the retrieving of present node, be then at the same level from the present node
The next node of the brotgher of node starts successively to scan for;
S55: when node all in the prefix trees traversed finish or search process stop when, terminate similar character
Accord with the retrieving of string.
Referring to Fig. 4, the application also provides a kind of system of quick-searching similar character string, the system comprises:
Textual entry processing unit 100 for reading the textual entry of existing preset quantity, and is directed to every text item
The textual entry is split as several phrases, and distributes corresponding weighted value for each phrase by mesh;
First Hash character string determination unit 200 carries out Hash to the phrase after fractionation for the weighted value based on distribution
Operation, to obtain the corresponding first Hash character string of the textual entry;
Processing unit 300 of collapsing for carrying out processing of collapsing to the first Hash character string meets finger to obtain length
Second Hash character string of fixed condition;
Retrieval unit 400, for establishing prefix trees to the second Hash character string, and based on the prefix trees from described
Character string similar with target string is retrieved in the textual entry of existing preset quantity.
In the present embodiment, the processing unit of collapsing includes:
Module is split, for the first Hash character string to be split as multiple substrings according to fixed intervals, and is
It splits obtained each substring and distributes same weighted value;
SimHash module, for utilize SimHash algorithm, to after fractionation substring and its corresponding weighted value into
Row processing, to obtain the corresponding third Hash character string of the first Hash character string;
Module is cut, for cutting to the third Hash character string, so that the second Hash character after cutting
The length of string is less than the length of the first Hash character string, and the second Hash character string and the first Hash character
Corresponding relationship between string is not isolated.
In the present embodiment, the retrieval unit includes:
Target string processing module for the target string to be split as several phrases, and is each phrase point
With corresponding weighted value;
4th Hash character string determining module, for utilize SimHash algorithm, to after fractionation phrase and its corresponding power
Weight values are handled, to obtain the corresponding 4th Hash character string of the target string;
Collapse processing module, for carrying out processing of collapsing to the 4th Hash character string, with obtain length be less than it is described
5th Hash character string of the 4th Hash character string;
Intermediate retrieval module, for being retrieved in the prefix trees to the 5th Hash character string, to obtain
One results set;
Retrieval module again, for establishing new prefix trees to first results set, and in the new prefix trees
In the 4th Hash character string is retrieved, to obtain the second results set;
As a result determining module, for using second results set as character similar with the 4th Hash character string
The set of string.
In the present embodiment, the retrieval unit includes:
Editing distance computing module for the successively search downwards since the top mode of the prefix trees, and calculates and works as
Editing distance between front nodal point and the Hash character string of the target string;
First determination module, for repeating the editing distance and calculating mould when the editing distance is less than specified threshold
The treatment process of block, to complete the search to child node;
Second determination module, for stopping present node and institute when the editing distance reaches the specified threshold
The search process of the child node of present node is stated, and is opened from the next node for being in the brotgher of node at the same level with the present node
Beginning successively scans for;
Third determination module, if be used in present node without child node, then it is assumed that the corresponding Hash character of the node
String and the Hash character string of target string are similar, and stop the retrieving of present node, then from it is described currently
The next node that node is in the brotgher of node at the same level starts successively to scan for;
Ending module is retrieved, has traversed and has finished or search process stops for all nodes in the prefix trees
When, terminate the retrieving of similar character string.
The extremely complex and huge text affinity matching process of operand can be converted to have several and closed by the present invention
The lookup of the prefix trees of connection relationship or dynamic generation can be matched roughly the same in a certain range by control similarity threshold
Similar Text.The time complexity of the algorithm compared with the editing distance between calculating character string one by one for, small several numbers
Magnitude, to greatly improve recall precision.
Those skilled in the art are supplied to the purpose described to the description of the various embodiments of the application above.It is not
It is intended to exhaustion or be not intended to and limit the invention to single disclosed embodiment.As described above, the application's is various
Substitution and variation will be apparent for above-mentioned technology one of ordinary skill in the art.Therefore, although specifically begging for
Some alternative embodiments are discussed, but other embodiment will be apparent or those skilled in the art are opposite
It is easy to obtain.The application is intended to include all substitutions of the invention discussed herein, modification and variation, and falls in
Other embodiment in the spirit and scope of above-mentioned application.