CN109241124A

CN109241124A - A kind of method and system of quick-searching similar character string

Info

Publication number: CN109241124A
Application number: CN201710558849.4A
Authority: CN
Inventors: 李光曦
Original assignee: Shanghai Education Technology (shanghai) Ltd By Share Ltd
Current assignee: Shanghai Xinhu Education Technology Co ltd
Priority date: 2017-07-11
Filing date: 2017-07-11
Publication date: 2019-01-18
Anticipated expiration: 2037-07-11
Also published as: CN109241124B

Abstract

The application provides a kind of method and system of quick-searching similar character string, wherein, which comprises read the textual entry of existing preset quantity, and be directed to every textual entry, the textual entry is split as several phrases, and distributes corresponding weighted value for each phrase；Weighted value based on distribution carries out Hash operation to the phrase after fractionation, to obtain the corresponding first Hash character string of the textual entry；Processing of collapsing is carried out to the first Hash character string, to obtain the second Hash character string that length meets specified requirements；Prefix trees are established to the second Hash character string, and retrieve character string similar with target string from the textual entry of the existing preset quantity based on the prefix trees.The speed of string search can be greatly improved in technical solution provided by the present application.

Description

A kind of method and system of quick-searching similar character string

Technical field

This application involves technical field of information processing, in particular to the method for a kind of quick-searching similar character string and it is System.

Background technique

In current technical field of information processing, it is often necessary to inquiry and target string in the textual entry of magnanimity Similar character string, existing algorithm be each character string in the textual entry to target string and magnanimity calculate editor away from From, and all character strings that editing distance is less than some threshold value are classified as similar character string.

This method time complexity in the prior art is high, in the case where hundreds of thousands of textual entries often performance without Method reaches commercial require.In addition to needing the textual entry number that compares, the time complexity of existing algorithm also with all texts The character string average length of entry is related, can not be using in the scene of big data quantity by now.

Summary of the invention

A kind of method and system for being designed to provide quick-searching similar character string of the application embodiment, Neng Gouji The big speed for improving string search.

To achieve the above object, on the one hand the application provides the method for a kind of quick-searching similar character string, the method Include:

The textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split as Several phrases, and corresponding weighted value is distributed for each phrase；

Weighted value based on distribution carries out Hash operation to the phrase after fractionation, corresponding to obtain the textual entry First Hash character string；

Processing of collapsing is carried out to the first Hash character string, to obtain the second Hash character that length meets specified requirements String；

Prefix trees are established to the second Hash character string, and based on the prefix trees from the existing preset quantity Character string similar with target string is retrieved in textual entry.

In the present embodiment, distributing corresponding weighted value for each phrase includes:

According to the relevance of current phrase and the textual entry, corresponding weighted value is distributed for the current word group；Its In, the relevance is higher, and corresponding weighted value is then bigger.

In the present embodiment, carrying out Hash operation to the phrase after fractionation includes:

Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the text The corresponding first Hash character string of entry.

In the present embodiment, include: to the first Hash character string processing that collapse

The first Hash character string is split as multiple substrings according to fixed intervals, and for split obtain it is each Substring distributes same weighted value；

Using SimHash algorithm, to after fractionation substring and its corresponding weighted value handle, it is described to obtain The corresponding third Hash character string of first Hash character string；

If desired, the third Hash character string is cut, so that the second Hash character string after cutting Length is less than the length of the first Hash character string, and the second Hash character string and the first Hash character string it Between corresponding relationship do not isolated.

In the present embodiment, retrieving character string similar with target string includes:

The target string is split as several phrases, and distributes corresponding weighted value for each phrase；

Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the target The corresponding 4th Hash character string of character string；

Processing of collapsing is carried out to the 4th Hash character string, to obtain the of length less than the 4th Hash character string Five Hash character strings；

The 5th Hash character string is retrieved in the prefix trees, to obtain the first results set；

New prefix trees are established to first results set, and to the 4th Hash word in the new prefix trees Symbol string is retrieved, to obtain the second results set；

Using second results set as the set of character string similar with the 4th Hash character string.

In the present embodiment, retrieved from the textual entry of the existing preset quantity based on the prefix trees with The similar character string of target string includes:

S51: the successively search downwards since the top mode of the prefix trees, and calculate present node and the target word Editing distance between symbol string；

S52: when the editing distance is less than specified threshold, repeating step S51, to complete the search to child node；

S53: when the editing distance reaches the specified threshold, stop the son of present node and the present node The search process of node, and successively searched since the next node for being in the brotgher of node at the same level with the present node Rope；

S54: if without child node in present node, then it is assumed that the corresponding Hash character string of the node and target string Hash character string be similar, and stop the retrieving of present node, be then at the same level from the present node The next node of the brotgher of node starts successively to scan for；

S55: when node all in the prefix trees traversed finish or search process stop when, terminate similar character Accord with the retrieving of string.

To achieve the above object, the application also provides a kind of system of quick-searching similar character string, the system comprises:

Textual entry processing unit for reading the textual entry of existing preset quantity, and is directed to every textual entry, The textual entry is split as several phrases, and distributes corresponding weighted value for each phrase；

First Hash character string determination unit carries out Hash fortune to the phrase after fractionation for the weighted value based on distribution It calculates, to obtain the corresponding first Hash character string of the textual entry；

Collapse processing unit, for carrying out processing of collapsing to the first Hash character string, with obtain length meet it is specified Second Hash character string of condition；

Retrieval unit, for establishing prefix trees to the second Hash character string, and based on the prefix trees from it is described Character string similar with target string is retrieved in the textual entry of some preset quantities.

In the present embodiment, the processing unit of collapsing includes:

Module is split, for the first Hash character string to be split as multiple substrings according to fixed intervals, and is It splits obtained each substring and distributes same weighted value；

SimHash module, for utilize SimHash algorithm, to after fractionation substring and its corresponding weighted value into Row processing, to obtain the corresponding third Hash character string of the first Hash character string；

Module is cut, for cutting to the third Hash character string, so that the second Hash character after cutting The length of string is less than the length of the first Hash character string, and the second Hash character string and the first Hash character Corresponding relationship between string is not isolated.

In the present embodiment, the retrieval unit includes:

Target string processing module for the target string to be split as several phrases, and is each phrase point With corresponding weighted value；

4th Hash character string determining module, for utilize SimHash algorithm, to after fractionation phrase and its corresponding power Weight values are handled, to obtain the corresponding 4th Hash character string of the target string；

Collapse processing module, for carrying out processing of collapsing to the 4th Hash character string, with obtain length be less than it is described 5th Hash character string of the 4th Hash character string；

Intermediate retrieval module, for being retrieved in the prefix trees to the 5th Hash character string, to obtain One results set；

Retrieval module again, for establishing new prefix trees to first results set, and in the new prefix trees In the 4th Hash character string is retrieved, to obtain the second results set；

As a result determining module, for using second results set as character similar with the 4th Hash character string The set of string.

In the present embodiment, the retrieval unit includes:

Editing distance computing module for the successively search downwards since the top mode of the prefix trees, and calculates and works as Editing distance between front nodal point and the Hash character string of the target string；

First determination module, for repeating the editing distance and calculating mould when the editing distance is less than specified threshold The treatment process of block, to complete the search to child node；

Second determination module, for stopping present node and institute when the editing distance reaches the specified threshold The search process of the child node of present node is stated, and is opened from the next node for being in the brotgher of node at the same level with the present node Beginning successively scans for；

Third determination module, if be used in present node without child node, then it is assumed that the corresponding Hash character of the node String and the Hash character string of target string are similar, and stop the retrieving of present node, then from it is described currently The next node that node is in the brotgher of node at the same level starts successively to scan for；

Ending module is retrieved, has traversed and has finished or search process stops for all nodes in the prefix trees When, terminate the retrieving of similar character string.

The extremely complex and huge text affinity matching process of operand can be converted to have several and closed by the present invention The lookup of the prefix trees of connection relationship or dynamic generation can be matched roughly the same in a certain range by control similarity threshold Similar Text.The time complexity of the algorithm compared with the editing distance between calculating character string one by one for, small several numbers Magnitude, to greatly improve recall precision.

Referring to following description and accompanying drawings, specific implementations of the present application are disclosed in detail, specify the original of the application Reason can be in a manner of adopted.It should be understood that presently filed embodiment is not so limited in range.In appended power In the range of the spirit and terms that benefit requires, presently filed embodiment includes many changes, modifications and is equal.

The feature for describing and/or showing for a kind of embodiment can be in a manner of same or similar one or more It uses in a other embodiment, is combined with the feature in other embodiment, or the feature in substitution other embodiment.

It should be emphasized that term "comprises/comprising" refers to the presence of feature, one integral piece, step or component when using herein, but simultaneously It is not excluded for the presence or additional of one or more other features, one integral piece, step or component.

Detailed description of the invention

Included attached drawing is used to provide to be further understood from the application embodiment, and which constitute the one of specification The principle of the application for illustrating presently filed embodiment, and with verbal description is come together to illustrate in part.It should be evident that The accompanying drawings in the following description is only some embodiments of the application, for those of ordinary skill in the art, is not being paid Out under the premise of creative labor, it is also possible to obtain other drawings based on these drawings.In the accompanying drawings:

Fig. 1 is the Establishing process figure of prefix trees in the application embodiment；

Fig. 2 is the retrieval flow figure of similar character string in the application embodiment；

Fig. 3 is the retrieval flow figure of similar character string in the application another embodiment；

Fig. 4 is the functional block diagram of the system of quick-searching similar character string in the application embodiment.

Specific embodiment

In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in mode is applied, the technical solution in the application embodiment is clearly and completely described, it is clear that described Embodiment is only a part of embodiment of the application, rather than whole embodiments.Based on the embodiment party in the application Formula, all other embodiment obtained by those of ordinary skill in the art without making creative efforts, is all answered When the range for belonging to the application protection.

Referring to Fig. 1, the application provides a kind of method of quick-searching similar character string, which comprises

S1: the textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split For several phrases, and corresponding weighted value is distributed for each phrase；

S2: the weighted value based on distribution carries out Hash operation to the phrase after fractionation, corresponding to obtain the textual entry The first Hash character string；

S3: processing of collapsing is carried out to the first Hash character string, to obtain the second Hash that length meets specified requirements Character string；

S4: prefix trees are established to the second Hash character string, and are based on the prefix trees from the existing present count Character string similar with target string is retrieved in the textual entry of amount.

In the present embodiment, the relevance of current phrase and textual entry can by calculate vector between space away from From determining.Specifically, current phrase and textual entry can be converted into term vector, in this way, by calculate two words to Space length between amount may thereby determine that the relevance between the two, and distance is closer, and relevance is then higher.

The third Hash character string is cut, so that the length of the second Hash character string after cutting is less than institute The length of the first Hash character string is stated, and the corresponding pass between the second Hash character string and the first Hash character string System is not isolated.

Referring to Fig. 2, in the present embodiment, retrieving character string similar with target string includes:

Specifically, in an application scenarios, similar character string can be retrieved according to the following steps:

Existing magnanimity character string to be treated is segmented according to segmentation methods, and the text after participle is extracted Feature；

Different weights are assigned to different characteristic, the Hash operation of local sensitivity is carried out to it using SimHash algorithm, obtains Hash character string H1；

Several character fields H2 is cut into according to fixed character interval to H1, it is (logical that consistent weight is arranged to each character field It sets up 1), then to carry out SimHash operation again to the character field H2 after cutting, obtains Hash character string H3；

H3 is cut, so that the length of H3 is less than H1, this process is the collapse of Hash, claims collapse algorithm；

Prefix trees T1 is established to H3, while guaranteeing that the corresponding relationship of H1 and H3 is not isolated；

For the character string H4 to be retrieved of input, its cryptographic Hash H5, H6 twice is calculated in the manner previously described；

The quick similar lookup to T1 tree is completed by H6, obtains a set S1；

Prefix trees T2 is established to S1, the quick similar lookup to T2 is completed by H5, finally obtains a set S2；

Set S2 can consider be H4 similar set.

Referring to Fig. 3, in one embodiment of the application, based on the prefix trees from the existing preset quantity Character string similar with target string is retrieved in textual entry includes:

Referring to Fig. 4, the application also provides a kind of system of quick-searching similar character string, the system comprises:

Textual entry processing unit 100 for reading the textual entry of existing preset quantity, and is directed to every text item The textual entry is split as several phrases, and distributes corresponding weighted value for each phrase by mesh；

First Hash character string determination unit 200 carries out Hash to the phrase after fractionation for the weighted value based on distribution Operation, to obtain the corresponding first Hash character string of the textual entry；

Processing unit 300 of collapsing for carrying out processing of collapsing to the first Hash character string meets finger to obtain length Second Hash character string of fixed condition；

Retrieval unit 400, for establishing prefix trees to the second Hash character string, and based on the prefix trees from described Character string similar with target string is retrieved in the textual entry of existing preset quantity.

In the present embodiment, the processing unit of collapsing includes:

In the present embodiment, the retrieval unit includes:

Those skilled in the art are supplied to the purpose described to the description of the various embodiments of the application above.It is not It is intended to exhaustion or be not intended to and limit the invention to single disclosed embodiment.As described above, the application's is various Substitution and variation will be apparent for above-mentioned technology one of ordinary skill in the art.Therefore, although specifically begging for Some alternative embodiments are discussed, but other embodiment will be apparent or those skilled in the art are opposite It is easy to obtain.The application is intended to include all substitutions of the invention discussed herein, modification and variation, and falls in Other embodiment in the spirit and scope of above-mentioned application.

Claims

1. a kind of method of quick-searching similar character string, which is characterized in that the described method includes:

The textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split as several Phrase, and corresponding weighted value is distributed for each phrase；

Weighted value based on distribution carries out Hash operation to the phrase after fractionation, to obtain the textual entry corresponding first Hash character string；

Processing of collapsing is carried out to the first Hash character string, to obtain the second Hash character string that length meets specified requirements；

Prefix trees are established to the second Hash character string, and based on the prefix trees from the text of the existing preset quantity Character string similar with target string is retrieved in entry.

2. the method according to claim 1, wherein including: for the corresponding weighted value of each phrase distribution

According to the relevance of current phrase and the textual entry, corresponding weighted value is distributed for the current word group；Wherein, institute It is higher to state relevance, corresponding weighted value is then bigger.

3. the method according to claim 1, wherein including: to the phrase progress Hash operation after fractionation

Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the textual entry Corresponding first Hash character string.

4. the method according to claim 1, wherein carrying out processing packet of collapsing to the first Hash character string It includes:

The first Hash character string is split as multiple substrings according to fixed intervals, and to split each obtained sub- word Symbol string distributes same weighted value；

Using SimHash algorithm, to after fractionation substring and its corresponding weighted value handle, to obtain described first The corresponding third Hash character string of Hash character string；

If it is required, then being cut to the third Hash character string, so that the length of the second Hash character string after cutting Degree is less than the length of the first Hash character string, and between the second Hash character string and the first Hash character string Corresponding relationship do not isolated.

5. according to the method described in claim 4, it is characterized in that, retrieving character string similar with target string and including:

Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the target character Go here and there corresponding 4th Hash character string；

Processing of collapsing is carried out to the 4th Hash character string, is breathed out with obtaining the 5th of length less than the 4th Hash character string Uncommon character string；

New prefix trees are established to first results set, and to the 4th Hash character string in the new prefix trees It is retrieved, to obtain the second results set；

6. the method according to claim 1, wherein based on the prefix trees from the existing preset quantity Character string similar with target string is retrieved in textual entry includes:

S51: the successively search downwards since the top mode of the prefix trees, and calculate present node and the target string Between editing distance；

S53: when the editing distance reaches the specified threshold, stop the child node of present node and the present node Search process, and successively scanned for since the next node for being in the brotgher of node at the same level with the present node；

S54: if without child node in present node, then it is assumed that the Kazakhstan of the node corresponding Hash character string and target string Uncommon character string is similar, and stops the retrieving of present node, is then in brother at the same level from the present node The next node of node starts successively to scan for；

S55: when node all in the prefix trees traversed finish or search process stop when, terminate similar character string Retrieving.

7. a kind of system of quick-searching similar character string, which is characterized in that the system comprises:

Textual entry processing unit for reading the textual entry of existing preset quantity, and is directed to every textual entry, by institute It states textual entry and is split as several phrases, and distribute corresponding weighted value for each phrase；

First Hash character string determination unit carries out Hash operation to the phrase after fractionation for the weighted value based on distribution, with Obtain the corresponding first Hash character string of the textual entry；

Processing unit of collapsing for carrying out processing of collapsing to the first Hash character string meets specified requirements to obtain length The second Hash character string；

Retrieval unit, for establishing prefix trees to the second Hash character string, and based on the prefix trees from described existing Character string similar with target string is retrieved in the textual entry of preset quantity.

8. system according to claim 7, which is characterized in that the processing unit of collapsing includes:

Module is split, for the first Hash character string to be split as multiple substrings according to fixed intervals, and to split Obtained each substring distributes same weighted value；

SimHash module, for utilizing SimHash algorithm, at the substring and its corresponding weighted value after fractionation Reason, to obtain the corresponding third Hash character string of the first Hash character string；

Module is cut, for cutting to the third Hash character string, so that the second Hash character string after cutting Length is less than the length of the first Hash character string, and the second Hash character string and the first Hash character string it Between corresponding relationship do not isolated.

9. system according to claim 8, which is characterized in that the retrieval unit includes:

Target string processing module for the target string to be split as several phrases, and is the distribution pair of each phrase The weighted value answered；

4th Hash character string determining module, for utilizing SimHash algorithm, to the phrase and its corresponding weighted value after fractionation It is handled, to obtain the corresponding 4th Hash character string of the target string；

Processing module of collapsing is less than the described 4th for carrying out processing of collapsing to the 4th Hash character string to obtain length 5th Hash character string of Hash character string；

Intermediate retrieval module, for being retrieved in the prefix trees to the 5th Hash character string, to obtain the first knot Fruit set；

Retrieval module again, for establishing new prefix trees to first results set, and it is right in the new prefix trees The 4th Hash character string is retrieved, to obtain the second results set；

As a result determining module, for using second results set as character string similar with the 4th Hash character string Set.

10. system according to claim 7, which is characterized in that the retrieval unit includes:

Editing distance computing module for the successively search downwards since the top mode of the prefix trees, and calculates and works as prosthomere Editing distance between point and the Hash character string of the target string；

First determination module, for repeating the editing distance computing module when the editing distance is less than specified threshold Treatment process, to complete the search to child node；

Second determination module stops present node and described works as when the editing distance reaches the specified threshold The search process of the child node of front nodal point, and since the next node for being in the brotgher of node at the same level with the present node by Layer scans for；

Third determination module, if in present node without child node, then it is assumed that the corresponding Hash character string of the node with The Hash character string of target string is similar, and stops the retrieving of present node, then from the present node Next node in the brotgher of node at the same level starts successively to scan for；

Retrieve ending module, for when node all in the prefix trees traversed finish or search process stop when, Terminate the retrieving of similar character string.