CN109241124B

CN109241124B - Method and system for quickly retrieving similar character strings

Info

Publication number: CN109241124B
Application number: CN201710558849.4A
Authority: CN
Inventors: 李光曦
Original assignee: Hujiang Education Technology Shanghai Co ltd
Current assignee: Shanghai Xinhu Education Technology Co ltd
Priority date: 2017-07-11
Filing date: 2017-07-11
Publication date: 2023-03-10
Anticipated expiration: 2037-07-11
Also published as: CN109241124A

Abstract

The application provides a method and a system for quickly retrieving similar character strings, wherein the method comprises the following steps: reading the existing text entries with preset number, splitting the text entries into a plurality of phrases aiming at each text entry, and distributing corresponding weight values for each phrase; based on the distributed weight value, carrying out hash operation on the split phrase to obtain a first hash character string corresponding to the text entry; collapsing the first hash character string to obtain a second hash character string with the length meeting the specified condition; and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text entries with the preset number based on the prefix tree. The technical scheme provided by the application can greatly improve the speed of character string retrieval.

Description

Method and system for quickly retrieving similar character strings

Technical field

The present application relates to the field of information processing technologies, and in particular, to a method and a system for quickly retrieving similar character strings.

Background art A

In the current technical field of information processing, it is often necessary to search a large number of text entries for a character string similar to a target character string, and an existing algorithm is to calculate an edit distance between the target character string and each character string in the large number of text entries, and to determine all character strings with edit distances smaller than a certain threshold as similar character strings.

The method in the prior art is extremely high in time complexity, and the performance of the method cannot meet commercial requirements under the condition of hundreds of thousands of text entries. Besides the number of text entries to be compared, the time complexity of the existing algorithm is also related to the average length of character strings of all text entries, and the algorithm cannot be applied to the scenes with large data volume nowadays.

Summary of the invention

The embodiment of the application aims to provide a method and a system for quickly searching similar character strings, which can greatly improve the speed of character string searching.

In order to achieve the above object, an aspect of the present application provides a method for quickly retrieving similar character strings, where the method includes:

reading the existing text entries with preset number, splitting the text entries into a plurality of phrases aiming at each text entry, and distributing corresponding weight values for each phrase;

based on the distributed weight value, carrying out hash operation on the split phrase to obtain a first hash character string corresponding to the text entry;

collapsing the first hash character string to obtain a second hash character string with the length meeting the specified condition;

and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text entries with the preset number based on the prefix tree.

In this embodiment, assigning a corresponding weight value to each phrase includes:

distributing a corresponding weight value for the current phrase according to the relevance between the current phrase and the text entry; wherein, the higher the relevance is, the larger the corresponding weight value is.

In this embodiment, performing the hash operation on the split phrase includes:

and processing the split phrases and the corresponding weight values thereof by using a SimHash algorithm to obtain a first Hash character string corresponding to the text entry.

In this embodiment, collapsing the first hash string includes:

splitting the first hash character string into a plurality of sub character strings at fixed intervals, and distributing the same weight value to each sub character string obtained by splitting;

processing the split sub-character strings and the weight values corresponding to the sub-character strings by using a SimHash algorithm to obtain a third Hash character string corresponding to the first Hash character string;

if necessary, the third hash character string is cut, so that the length of the cut second hash character string is smaller than that of the first hash character string, and the corresponding relation between the second hash character string and the first hash character string is not split.

In the present embodiment, the search for a character string similar to the target character string includes:

splitting the target character string into a plurality of phrases, and distributing a corresponding weight value for each phrase;

processing the split phrase and the corresponding weight value thereof by using a SimHash algorithm to obtain a fourth Hash character string corresponding to the target character string;

collapsing the fourth hash character string to obtain a fifth hash character string with the length smaller than that of the fourth hash character string;

retrieving the fifth hash string in the prefix tree to obtain a first result set;

establishing a new prefix tree for the first result set, and retrieving the fourth hash character string in the new prefix tree to obtain a second result set;

the second result set is used as a set of character strings similar to the fourth hash character string.

In this embodiment, retrieving a string similar to a target string from the existing preset number of text entries based on the prefix tree includes:

s51: searching downwards layer by layer from the top node of the prefix tree, and calculating the editing distance between the current node and the target character string;

s52: when the edit distance is smaller than a designated threshold, repeating the step S51 to complete the search of the child node;

s53: when the editing distance reaches the designated threshold value, stopping the searching process of the current node and the child node of the current node, and searching layer by layer from the next node of the brother node at the same level as the current node;

s54: if the current node does not have a child node, the hash character string corresponding to the node is considered to be similar to the hash character string of the target character string, the retrieval process of the current node is stopped, and then the next node of the brother node which is in the same level as the current node is searched layer by layer;

s55: and when all the nodes in the prefix tree are traversed or the searching process is stopped, ending the searching process of the similar character strings.

To achieve the above object, the present application further provides a system for quickly retrieving similar character strings, the system comprising:

the text entry processing unit is used for reading the existing text entries with the preset number, splitting the text entries into a plurality of word groups aiming at each text entry and distributing corresponding weight values for each word group;

the first hash character string determining unit is used for carrying out hash operation on the split phrases based on the distributed weight values so as to obtain a first hash character string corresponding to the text entry;

the collapse processing unit is used for collapsing the first Hash character string to obtain a second Hash character string with the length meeting the specified condition;

and the retrieval unit is used for establishing a prefix tree for the second Hash character string and retrieving character strings similar to the target character string from the existing text items with the preset number based on the prefix tree.

In this embodiment, the collapse processing unit includes:

the splitting module is used for splitting the first Hash character string into a plurality of sub character strings at fixed intervals and distributing the same weight value to each sub character string obtained by splitting;

the SimHash module is used for processing the split substrings and the weight values corresponding to the split substrings by using a SimHash algorithm to obtain third Hash strings corresponding to the first Hash strings;

and the cutting module is used for cutting the third Hash character string so that the length of the cut second Hash character string is smaller than that of the first Hash character string, and the corresponding relation between the second Hash character string and the first Hash character string is not split.

In this embodiment, the search means includes:

the target character string processing module is used for splitting the target character string into a plurality of phrases and distributing a corresponding weight value to each phrase;

the fourth hash character string determining module is used for processing the split word group and the weight value corresponding to the split word group by using a SimHash algorithm to obtain a fourth hash character string corresponding to the target character string;

the collapse processing module is used for collapsing the fourth hash character string to obtain a fifth hash character string with the length smaller than that of the fourth hash character string;

the intermediate retrieval module is used for retrieving the fifth hash character string in the prefix tree to obtain a first result set;

a second retrieval module, configured to establish a new prefix tree for the first result set, and retrieve the fourth hash string in the new prefix tree to obtain a second result set;

a result determination module to treat the second set of results as a set of strings that are similar to the fourth hash string.

In this embodiment, the search means includes:

the editing distance calculation module is used for searching downwards layer by layer from a top node of the prefix tree and calculating the editing distance between the current node and the hash character string of the target character string;

the first judgment module is used for repeating the processing process of the edit distance calculation module when the edit distance is smaller than a specified threshold value so as to complete the search of child nodes;

the second judgment module is used for stopping the searching process of the current node and the child node of the current node when the editing distance reaches the specified threshold value, and searching layer by layer from the next node of the brother node at the same level as the current node;

a third judging module, configured to, if there is no child node in the current node, consider that the hash character string corresponding to the node is similar to the hash character string of the target character string, terminate the retrieval process of the current node, and then search layer by layer starting from a next node of a sibling node that is at the same level as the current node;

and the retrieval ending module is used for ending the retrieval process of the similar character strings when all the nodes in the prefix tree are traversed or the searching process is stopped.

The invention can convert the very complicated text similarity matching process with huge calculation amount into the search of a plurality of prefix trees with incidence relation or dynamic generation, and similar texts which are approximately consistent can be matched in a certain range by controlling the similarity threshold value. The time complexity of the algorithm is several orders of magnitude smaller than the one-by-one calculation of the edit distance between character strings, so that the retrieval efficiency is greatly improved.

Specific embodiments of the present application are disclosed in detail with reference to the following description and drawings, indicating the manner in which the principles of the application may be employed. It should be understood that the embodiments of the present application are not so limited in scope. The embodiments of the application include many variations, modifications and equivalents within the spirit and scope of the appended claims.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments, in combination with or instead of the features of the other embodiments.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps or components.

Description of the drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the application, are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. It should be apparent that the drawings in the following description are merely some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive exercise. In the drawings:

FIG. 1 is a flow chart illustrating the building of a prefix tree according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating the retrieval of similar character strings according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of similar string retrieval in another embodiment of the present application;

fig. 4 is a functional block diagram of a system for quickly retrieving similar character strings according to an embodiment of the present application.

Detailed description of the invention

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making creative efforts shall fall within the protection scope of the present application.

Referring to fig. 1, the present application provides a method for quickly retrieving similar character strings, including:

s1: reading the existing text entries with preset number, splitting the text entries into a plurality of phrases aiming at each text entry, and distributing corresponding weight values for each phrase;

s2: based on the distributed weight value, carrying out hash operation on the split phrase to obtain a first hash character string corresponding to the text entry;

s3: collapsing the first Hash character string to obtain a second Hash character string with the length meeting specified conditions;

s4: and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text items with the preset number based on the prefix tree.

In this embodiment, the relevance of the current phrase to the text entry may be determined by calculating the spatial distance between the vectors. Specifically, the current phrase and the text entry may both be converted into word vectors, such that by calculating the spatial distance between two word vectors, the association between the two may be determined, with the closer the distance, the higher the association.

In this embodiment, the performing the hash operation on the split phrase includes:

and processing the split word group and the corresponding weight value thereof by using a SimHash algorithm to obtain a first Hash character string corresponding to the text entry.

In this embodiment, collapsing the first hash string includes:

and cutting the third hash character string to enable the length of the cut second hash character string to be smaller than that of the first hash character string, and enabling the corresponding relation between the second hash character string and the first hash character string not to be split.

Referring to fig. 2, in the present embodiment, the retrieving of the character string similar to the target character string includes:

collapsing the fourth Hash character string to obtain a fifth Hash character string of which the length is smaller than that of the fourth Hash character string;

Specifically, in one application scenario, the similar character strings may be retrieved according to the following steps:

performing word segmentation on the existing massive character strings to be processed according to a word segmentation algorithm, and extracting characteristics of the text subjected to word segmentation;

giving different weights to different characteristics, and performing local sensitive hash operation on the characteristics by using a SimHash algorithm to obtain a hash character string H1;

cutting H1 into a plurality of character segments H2 according to fixed character intervals, setting a consistent weight (generally set as 1) for each character segment, and then performing SimHash operation on the cut character segments H2 again to obtain a Hash character string H3;

cutting H3 to enable the length of H3 to be smaller than H1, wherein the process is Hash collapse and is called a collapse algorithm;

establishing a prefix tree T1 for H3, and simultaneously ensuring that the corresponding relation between H1 and H3 is not split;

calculating the two-time hash values H5 and H6 of the input character string H4 to be retrieved according to the method;

completing quick similarity search of the T1 tree through H6 to obtain a set S1;

establishing a prefix tree T2 for the S1, finishing quick similarity search for the T2 through H5, and finally obtaining a set S2;

this set S2 can be considered to be a similar set to H4.

Referring to fig. 3, in an embodiment of the present application, retrieving a character string similar to a target character string from the existing preset number of text entries based on the prefix tree includes:

s52: when the edit distance is smaller than a designated threshold, repeating the step S51 to complete the search of the child nodes;

s54: if the current node has no child node, the hash character string corresponding to the node is considered to be similar to the hash character string of the target character string, the retrieval process of the current node is stopped, and then the next node of the brother node at the same level as the current node is searched layer by layer;

Referring to fig. 4, the present application further provides a system for quickly retrieving similar character strings, the system comprising:

the text entry processing unit 100 is configured to read existing text entries of a preset number, split each text entry into a plurality of phrases, and assign a corresponding weight value to each phrase;

a first hash character string determining unit 200, configured to perform a hash operation on the split phrase based on the assigned weight value, so as to obtain a first hash character string corresponding to the text entry;

a collapse processing unit 300, configured to perform collapse processing on the first hash character string to obtain a second hash character string whose length meets a specified condition;

a retrieving unit 400, configured to establish a prefix tree for the second hash character string, and retrieve a character string similar to the target character string from the existing preset number of text entries based on the prefix tree.

In this embodiment, the collapse processing unit includes:

In this embodiment, the search means includes:

the fourth hash character string determining module is used for processing the split phrase and the corresponding weight value thereof by using a SimHash algorithm to obtain a fourth hash character string corresponding to the target character string;

a result determination module to treat the second set of results as a set of strings similar to the fourth hash string.

In this embodiment, the search means includes:

the second judgment module is used for stopping the searching process of the current node and the child node of the current node when the editing distance reaches the specified threshold value, and searching layer by layer from the next node of the brother node which is in the same level as the current node;

The invention can convert the very complicated text similarity matching process with huge calculation amount into the search of a plurality of prefix trees with incidence relation or dynamic generation, and similar texts which are approximately consistent can be matched in a certain range by controlling the similarity threshold value. The time complexity of the algorithm is several orders of magnitude smaller than that of one-by-one calculation of the editing distance between character strings, so that the retrieval efficiency is greatly improved.

The foregoing description of various embodiments of the present application is provided to those skilled in the art for the purpose of illustration. It is not intended to be exhaustive or to limit the invention to a single disclosed embodiment. As described above, various alternatives and modifications of the present application will be apparent to those skilled in the art to which the above-described technology pertains. Thus, while some alternative embodiments have been discussed in detail, other embodiments will be apparent or relatively easy to derive by those of ordinary skill in the art. This application is intended to cover all alternatives, modifications, and variations of the invention discussed herein, as well as other embodiments, which fall within the spirit and scope of the above-mentioned application.

Claims

1. A method for rapidly retrieving similar character strings, the method comprising:

collapsing the first Hash character string to obtain a second Hash character string with the length meeting specified conditions;

and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text items with the preset number based on the prefix tree.

2. The method of claim 1, wherein assigning a corresponding weight value to each phrase comprises:

3. The method of claim 1, wherein performing a hash operation on the split phrase comprises:

4. The method of claim 1, wherein collapsing the first hash string comprises:

and if so, cutting the third hash character string, so that the length of the cut second hash character string is smaller than that of the first hash character string, and the corresponding relation between the second hash character string and the first hash character string is not split.

5. The method of claim 4, wherein retrieving a string that is similar to the target string comprises:

splitting the target character string into a plurality of phrases, and assigning a corresponding weight value to each phrase;

retrieving the fifth hash character string in the prefix tree to obtain a first result set;

6. The method of claim 1, wherein retrieving a string similar to a target string from the existing predetermined number of text entries based on the prefix tree comprises:

s53: when the editing distance reaches the designated threshold value, stopping the searching process of the current node and the child node of the current node, and searching layer by layer from the next node of the brother node which is in the same level as the current node;

7. A system for rapidly retrieving similar character strings, the system comprising:

the text entry processing unit is used for reading the existing text entries with the preset number, splitting the text entries into a plurality of phrases aiming at each text entry and distributing corresponding weight values for each phrase;

8. The system of claim 7, wherein the collapse processing unit comprises:

9. The system of claim 8, wherein the retrieval unit comprises:

10. The system of claim 7, wherein the retrieval unit comprises:

a third judging module, configured to, if there is no child node in the current node, consider that the hash character string corresponding to the node is similar to the hash character string of the target character string, terminate the retrieval process of the current node, and then start searching layer by layer from a next node of a sibling node that is at the same level as the current node;