CN109241124B - Method and system for quickly retrieving similar character strings - Google Patents

Method and system for quickly retrieving similar character strings Download PDF

Info

Publication number
CN109241124B
CN109241124B CN201710558849.4A CN201710558849A CN109241124B CN 109241124 B CN109241124 B CN 109241124B CN 201710558849 A CN201710558849 A CN 201710558849A CN 109241124 B CN109241124 B CN 109241124B
Authority
CN
China
Prior art keywords
character string
hash
node
hash character
prefix tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710558849.4A
Other languages
Chinese (zh)
Other versions
CN109241124A (en
Inventor
李光曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinhu Education Technology Co ltd
Original Assignee
Hujiang Education Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hujiang Education Technology Shanghai Co ltd filed Critical Hujiang Education Technology Shanghai Co ltd
Priority to CN201710558849.4A priority Critical patent/CN109241124B/en
Publication of CN109241124A publication Critical patent/CN109241124A/en
Application granted granted Critical
Publication of CN109241124B publication Critical patent/CN109241124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application provides a method and a system for quickly retrieving similar character strings, wherein the method comprises the following steps: reading the existing text entries with preset number, splitting the text entries into a plurality of phrases aiming at each text entry, and distributing corresponding weight values for each phrase; based on the distributed weight value, carrying out hash operation on the split phrase to obtain a first hash character string corresponding to the text entry; collapsing the first hash character string to obtain a second hash character string with the length meeting the specified condition; and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text entries with the preset number based on the prefix tree. The technical scheme provided by the application can greatly improve the speed of character string retrieval.

Description

Method and system for quickly retrieving similar character strings
Technical field
The present application relates to the field of information processing technologies, and in particular, to a method and a system for quickly retrieving similar character strings.
Background art A
In the current technical field of information processing, it is often necessary to search a large number of text entries for a character string similar to a target character string, and an existing algorithm is to calculate an edit distance between the target character string and each character string in the large number of text entries, and to determine all character strings with edit distances smaller than a certain threshold as similar character strings.
The method in the prior art is extremely high in time complexity, and the performance of the method cannot meet commercial requirements under the condition of hundreds of thousands of text entries. Besides the number of text entries to be compared, the time complexity of the existing algorithm is also related to the average length of character strings of all text entries, and the algorithm cannot be applied to the scenes with large data volume nowadays.
Summary of the invention
The embodiment of the application aims to provide a method and a system for quickly searching similar character strings, which can greatly improve the speed of character string searching.
In order to achieve the above object, an aspect of the present application provides a method for quickly retrieving similar character strings, where the method includes:
reading the existing text entries with preset number, splitting the text entries into a plurality of phrases aiming at each text entry, and distributing corresponding weight values for each phrase;
based on the distributed weight value, carrying out hash operation on the split phrase to obtain a first hash character string corresponding to the text entry;
collapsing the first hash character string to obtain a second hash character string with the length meeting the specified condition;
and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text entries with the preset number based on the prefix tree.
In this embodiment, assigning a corresponding weight value to each phrase includes:
distributing a corresponding weight value for the current phrase according to the relevance between the current phrase and the text entry; wherein, the higher the relevance is, the larger the corresponding weight value is.
In this embodiment, performing the hash operation on the split phrase includes:
and processing the split phrases and the corresponding weight values thereof by using a SimHash algorithm to obtain a first Hash character string corresponding to the text entry.
In this embodiment, collapsing the first hash string includes:
splitting the first hash character string into a plurality of sub character strings at fixed intervals, and distributing the same weight value to each sub character string obtained by splitting;
processing the split sub-character strings and the weight values corresponding to the sub-character strings by using a SimHash algorithm to obtain a third Hash character string corresponding to the first Hash character string;
if necessary, the third hash character string is cut, so that the length of the cut second hash character string is smaller than that of the first hash character string, and the corresponding relation between the second hash character string and the first hash character string is not split.
In the present embodiment, the search for a character string similar to the target character string includes:
splitting the target character string into a plurality of phrases, and distributing a corresponding weight value for each phrase;
processing the split phrase and the corresponding weight value thereof by using a SimHash algorithm to obtain a fourth Hash character string corresponding to the target character string;
collapsing the fourth hash character string to obtain a fifth hash character string with the length smaller than that of the fourth hash character string;
retrieving the fifth hash string in the prefix tree to obtain a first result set;
establishing a new prefix tree for the first result set, and retrieving the fourth hash character string in the new prefix tree to obtain a second result set;
the second result set is used as a set of character strings similar to the fourth hash character string.
In this embodiment, retrieving a string similar to a target string from the existing preset number of text entries based on the prefix tree includes:
s51: searching downwards layer by layer from the top node of the prefix tree, and calculating the editing distance between the current node and the target character string;
s52: when the edit distance is smaller than a designated threshold, repeating the step S51 to complete the search of the child node;
s53: when the editing distance reaches the designated threshold value, stopping the searching process of the current node and the child node of the current node, and searching layer by layer from the next node of the brother node at the same level as the current node;
s54: if the current node does not have a child node, the hash character string corresponding to the node is considered to be similar to the hash character string of the target character string, the retrieval process of the current node is stopped, and then the next node of the brother node which is in the same level as the current node is searched layer by layer;
s55: and when all the nodes in the prefix tree are traversed or the searching process is stopped, ending the searching process of the similar character strings.
To achieve the above object, the present application further provides a system for quickly retrieving similar character strings, the system comprising:
the text entry processing unit is used for reading the existing text entries with the preset number, splitting the text entries into a plurality of word groups aiming at each text entry and distributing corresponding weight values for each word group;
the first hash character string determining unit is used for carrying out hash operation on the split phrases based on the distributed weight values so as to obtain a first hash character string corresponding to the text entry;
the collapse processing unit is used for collapsing the first Hash character string to obtain a second Hash character string with the length meeting the specified condition;
and the retrieval unit is used for establishing a prefix tree for the second Hash character string and retrieving character strings similar to the target character string from the existing text items with the preset number based on the prefix tree.
In this embodiment, the collapse processing unit includes:
the splitting module is used for splitting the first Hash character string into a plurality of sub character strings at fixed intervals and distributing the same weight value to each sub character string obtained by splitting;
the SimHash module is used for processing the split substrings and the weight values corresponding to the split substrings by using a SimHash algorithm to obtain third Hash strings corresponding to the first Hash strings;
and the cutting module is used for cutting the third Hash character string so that the length of the cut second Hash character string is smaller than that of the first Hash character string, and the corresponding relation between the second Hash character string and the first Hash character string is not split.
In this embodiment, the search means includes:
the target character string processing module is used for splitting the target character string into a plurality of phrases and distributing a corresponding weight value to each phrase;
the fourth hash character string determining module is used for processing the split word group and the weight value corresponding to the split word group by using a SimHash algorithm to obtain a fourth hash character string corresponding to the target character string;
the collapse processing module is used for collapsing the fourth hash character string to obtain a fifth hash character string with the length smaller than that of the fourth hash character string;
the intermediate retrieval module is used for retrieving the fifth hash character string in the prefix tree to obtain a first result set;
a second retrieval module, configured to establish a new prefix tree for the first result set, and retrieve the fourth hash string in the new prefix tree to obtain a second result set;
a result determination module to treat the second set of results as a set of strings that are similar to the fourth hash string.
In this embodiment, the search means includes:
the editing distance calculation module is used for searching downwards layer by layer from a top node of the prefix tree and calculating the editing distance between the current node and the hash character string of the target character string;
the first judgment module is used for repeating the processing process of the edit distance calculation module when the edit distance is smaller than a specified threshold value so as to complete the search of child nodes;
the second judgment module is used for stopping the searching process of the current node and the child node of the current node when the editing distance reaches the specified threshold value, and searching layer by layer from the next node of the brother node at the same level as the current node;
a third judging module, configured to, if there is no child node in the current node, consider that the hash character string corresponding to the node is similar to the hash character string of the target character string, terminate the retrieval process of the current node, and then search layer by layer starting from a next node of a sibling node that is at the same level as the current node;
and the retrieval ending module is used for ending the retrieval process of the similar character strings when all the nodes in the prefix tree are traversed or the searching process is stopped.
The invention can convert the very complicated text similarity matching process with huge calculation amount into the search of a plurality of prefix trees with incidence relation or dynamic generation, and similar texts which are approximately consistent can be matched in a certain range by controlling the similarity threshold value. The time complexity of the algorithm is several orders of magnitude smaller than the one-by-one calculation of the edit distance between character strings, so that the retrieval efficiency is greatly improved.
Specific embodiments of the present application are disclosed in detail with reference to the following description and drawings, indicating the manner in which the principles of the application may be employed. It should be understood that the embodiments of the present application are not so limited in scope. The embodiments of the application include many variations, modifications and equivalents within the spirit and scope of the appended claims.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments, in combination with or instead of the features of the other embodiments.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps or components.
Description of the drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the application, are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. It should be apparent that the drawings in the following description are merely some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive exercise. In the drawings:
FIG. 1 is a flow chart illustrating the building of a prefix tree according to an embodiment of the present disclosure;
FIG. 2 is a flowchart illustrating the retrieval of similar character strings according to an embodiment of the present disclosure;
FIG. 3 is a flow chart of similar string retrieval in another embodiment of the present application;
fig. 4 is a functional block diagram of a system for quickly retrieving similar character strings according to an embodiment of the present application.
Detailed description of the invention
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making creative efforts shall fall within the protection scope of the present application.
Referring to fig. 1, the present application provides a method for quickly retrieving similar character strings, including:
s1: reading the existing text entries with preset number, splitting the text entries into a plurality of phrases aiming at each text entry, and distributing corresponding weight values for each phrase;
s2: based on the distributed weight value, carrying out hash operation on the split phrase to obtain a first hash character string corresponding to the text entry;
s3: collapsing the first Hash character string to obtain a second Hash character string with the length meeting specified conditions;
s4: and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text items with the preset number based on the prefix tree.
In this embodiment, assigning a corresponding weight value to each phrase includes:
distributing a corresponding weight value for the current phrase according to the relevance between the current phrase and the text entry; wherein, the higher the relevance is, the larger the corresponding weight value is.
In this embodiment, the relevance of the current phrase to the text entry may be determined by calculating the spatial distance between the vectors. Specifically, the current phrase and the text entry may both be converted into word vectors, such that by calculating the spatial distance between two word vectors, the association between the two may be determined, with the closer the distance, the higher the association.
In this embodiment, the performing the hash operation on the split phrase includes:
and processing the split word group and the corresponding weight value thereof by using a SimHash algorithm to obtain a first Hash character string corresponding to the text entry.
In this embodiment, collapsing the first hash string includes:
splitting the first hash character string into a plurality of sub character strings at fixed intervals, and distributing the same weight value to each sub character string obtained by splitting;
processing the split sub-character strings and the weight values corresponding to the sub-character strings by using a SimHash algorithm to obtain a third Hash character string corresponding to the first Hash character string;
and cutting the third hash character string to enable the length of the cut second hash character string to be smaller than that of the first hash character string, and enabling the corresponding relation between the second hash character string and the first hash character string not to be split.
Referring to fig. 2, in the present embodiment, the retrieving of the character string similar to the target character string includes:
splitting the target character string into a plurality of phrases, and distributing a corresponding weight value for each phrase;
processing the split phrase and the corresponding weight value thereof by using a SimHash algorithm to obtain a fourth Hash character string corresponding to the target character string;
collapsing the fourth Hash character string to obtain a fifth Hash character string of which the length is smaller than that of the fourth Hash character string;
retrieving the fifth hash string in the prefix tree to obtain a first result set;
establishing a new prefix tree for the first result set, and retrieving the fourth hash character string in the new prefix tree to obtain a second result set;
the second result set is used as a set of character strings similar to the fourth hash character string.
Specifically, in one application scenario, the similar character strings may be retrieved according to the following steps:
performing word segmentation on the existing massive character strings to be processed according to a word segmentation algorithm, and extracting characteristics of the text subjected to word segmentation;
giving different weights to different characteristics, and performing local sensitive hash operation on the characteristics by using a SimHash algorithm to obtain a hash character string H1;
cutting H1 into a plurality of character segments H2 according to fixed character intervals, setting a consistent weight (generally set as 1) for each character segment, and then performing SimHash operation on the cut character segments H2 again to obtain a Hash character string H3;
cutting H3 to enable the length of H3 to be smaller than H1, wherein the process is Hash collapse and is called a collapse algorithm;
establishing a prefix tree T1 for H3, and simultaneously ensuring that the corresponding relation between H1 and H3 is not split;
calculating the two-time hash values H5 and H6 of the input character string H4 to be retrieved according to the method;
completing quick similarity search of the T1 tree through H6 to obtain a set S1;
establishing a prefix tree T2 for the S1, finishing quick similarity search for the T2 through H5, and finally obtaining a set S2;
this set S2 can be considered to be a similar set to H4.
Referring to fig. 3, in an embodiment of the present application, retrieving a character string similar to a target character string from the existing preset number of text entries based on the prefix tree includes:
s51: searching downwards layer by layer from the top node of the prefix tree, and calculating the editing distance between the current node and the target character string;
s52: when the edit distance is smaller than a designated threshold, repeating the step S51 to complete the search of the child nodes;
s53: when the editing distance reaches the designated threshold value, stopping the searching process of the current node and the child node of the current node, and searching layer by layer from the next node of the brother node at the same level as the current node;
s54: if the current node has no child node, the hash character string corresponding to the node is considered to be similar to the hash character string of the target character string, the retrieval process of the current node is stopped, and then the next node of the brother node at the same level as the current node is searched layer by layer;
s55: and when all the nodes in the prefix tree are traversed or the searching process is stopped, ending the searching process of the similar character strings.
Referring to fig. 4, the present application further provides a system for quickly retrieving similar character strings, the system comprising:
the text entry processing unit 100 is configured to read existing text entries of a preset number, split each text entry into a plurality of phrases, and assign a corresponding weight value to each phrase;
a first hash character string determining unit 200, configured to perform a hash operation on the split phrase based on the assigned weight value, so as to obtain a first hash character string corresponding to the text entry;
a collapse processing unit 300, configured to perform collapse processing on the first hash character string to obtain a second hash character string whose length meets a specified condition;
a retrieving unit 400, configured to establish a prefix tree for the second hash character string, and retrieve a character string similar to the target character string from the existing preset number of text entries based on the prefix tree.
In this embodiment, the collapse processing unit includes:
the splitting module is used for splitting the first hash character string into a plurality of sub character strings at fixed intervals and distributing the same weight value to each sub character string obtained by splitting;
the SimHash module is used for processing the split substrings and the weight values corresponding to the split substrings by using a SimHash algorithm to obtain third Hash strings corresponding to the first Hash strings;
and the cutting module is used for cutting the third Hash character string so that the length of the cut second Hash character string is smaller than that of the first Hash character string, and the corresponding relation between the second Hash character string and the first Hash character string is not split.
In this embodiment, the search means includes:
the target character string processing module is used for splitting the target character string into a plurality of phrases and distributing a corresponding weight value to each phrase;
the fourth hash character string determining module is used for processing the split phrase and the corresponding weight value thereof by using a SimHash algorithm to obtain a fourth hash character string corresponding to the target character string;
the collapse processing module is used for collapsing the fourth hash character string to obtain a fifth hash character string with the length smaller than that of the fourth hash character string;
the intermediate retrieval module is used for retrieving the fifth hash character string in the prefix tree to obtain a first result set;
a second retrieval module, configured to establish a new prefix tree for the first result set, and retrieve the fourth hash string in the new prefix tree to obtain a second result set;
a result determination module to treat the second set of results as a set of strings similar to the fourth hash string.
In this embodiment, the search means includes:
the editing distance calculation module is used for searching downwards layer by layer from a top node of the prefix tree and calculating the editing distance between the current node and the hash character string of the target character string;
the first judgment module is used for repeating the processing process of the edit distance calculation module when the edit distance is smaller than a specified threshold value so as to complete the search of child nodes;
the second judgment module is used for stopping the searching process of the current node and the child node of the current node when the editing distance reaches the specified threshold value, and searching layer by layer from the next node of the brother node which is in the same level as the current node;
a third judging module, configured to, if there is no child node in the current node, consider that the hash character string corresponding to the node is similar to the hash character string of the target character string, terminate the retrieval process of the current node, and then search layer by layer starting from a next node of a sibling node that is at the same level as the current node;
and the retrieval ending module is used for ending the retrieval process of the similar character strings when all the nodes in the prefix tree are traversed or the searching process is stopped.
The invention can convert the very complicated text similarity matching process with huge calculation amount into the search of a plurality of prefix trees with incidence relation or dynamic generation, and similar texts which are approximately consistent can be matched in a certain range by controlling the similarity threshold value. The time complexity of the algorithm is several orders of magnitude smaller than that of one-by-one calculation of the editing distance between character strings, so that the retrieval efficiency is greatly improved.
The foregoing description of various embodiments of the present application is provided to those skilled in the art for the purpose of illustration. It is not intended to be exhaustive or to limit the invention to a single disclosed embodiment. As described above, various alternatives and modifications of the present application will be apparent to those skilled in the art to which the above-described technology pertains. Thus, while some alternative embodiments have been discussed in detail, other embodiments will be apparent or relatively easy to derive by those of ordinary skill in the art. This application is intended to cover all alternatives, modifications, and variations of the invention discussed herein, as well as other embodiments, which fall within the spirit and scope of the above-mentioned application.

Claims (10)

1. A method for rapidly retrieving similar character strings, the method comprising:
reading the existing text entries with preset number, splitting the text entries into a plurality of phrases aiming at each text entry, and distributing corresponding weight values for each phrase;
based on the distributed weight value, carrying out hash operation on the split phrase to obtain a first hash character string corresponding to the text entry;
collapsing the first Hash character string to obtain a second Hash character string with the length meeting specified conditions;
and establishing a prefix tree for the second hash character string, and retrieving character strings similar to the target character string from the existing text items with the preset number based on the prefix tree.
2. The method of claim 1, wherein assigning a corresponding weight value to each phrase comprises:
distributing a corresponding weight value for the current phrase according to the relevance between the current phrase and the text entry; wherein, the higher the relevance is, the larger the corresponding weight value is.
3. The method of claim 1, wherein performing a hash operation on the split phrase comprises:
and processing the split word group and the corresponding weight value thereof by using a SimHash algorithm to obtain a first Hash character string corresponding to the text entry.
4. The method of claim 1, wherein collapsing the first hash string comprises:
splitting the first hash character string into a plurality of sub character strings at fixed intervals, and distributing the same weight value to each sub character string obtained by splitting;
processing the split sub-character strings and the weight values corresponding to the sub-character strings by using a SimHash algorithm to obtain a third Hash character string corresponding to the first Hash character string;
and if so, cutting the third hash character string, so that the length of the cut second hash character string is smaller than that of the first hash character string, and the corresponding relation between the second hash character string and the first hash character string is not split.
5. The method of claim 4, wherein retrieving a string that is similar to the target string comprises:
splitting the target character string into a plurality of phrases, and assigning a corresponding weight value to each phrase;
processing the split phrase and the corresponding weight value thereof by using a SimHash algorithm to obtain a fourth Hash character string corresponding to the target character string;
collapsing the fourth hash character string to obtain a fifth hash character string with the length smaller than that of the fourth hash character string;
retrieving the fifth hash character string in the prefix tree to obtain a first result set;
establishing a new prefix tree for the first result set, and retrieving the fourth hash character string in the new prefix tree to obtain a second result set;
the second result set is used as a set of character strings similar to the fourth hash character string.
6. The method of claim 1, wherein retrieving a string similar to a target string from the existing predetermined number of text entries based on the prefix tree comprises:
s51: searching downwards layer by layer from the top node of the prefix tree, and calculating the editing distance between the current node and the target character string;
s52: when the edit distance is smaller than a designated threshold, repeating the step S51 to complete the search of the child node;
s53: when the editing distance reaches the designated threshold value, stopping the searching process of the current node and the child node of the current node, and searching layer by layer from the next node of the brother node which is in the same level as the current node;
s54: if the current node does not have a child node, the hash character string corresponding to the node is considered to be similar to the hash character string of the target character string, the retrieval process of the current node is stopped, and then the next node of the brother node which is in the same level as the current node is searched layer by layer;
s55: and when all the nodes in the prefix tree are traversed or the searching process is stopped, ending the searching process of the similar character strings.
7. A system for rapidly retrieving similar character strings, the system comprising:
the text entry processing unit is used for reading the existing text entries with the preset number, splitting the text entries into a plurality of phrases aiming at each text entry and distributing corresponding weight values for each phrase;
the first hash character string determining unit is used for carrying out hash operation on the split phrases based on the distributed weight values so as to obtain a first hash character string corresponding to the text entry;
the collapse processing unit is used for collapsing the first Hash character string to obtain a second Hash character string with the length meeting the specified condition;
and the retrieval unit is used for establishing a prefix tree for the second Hash character string and retrieving character strings similar to the target character string from the existing text items with the preset number based on the prefix tree.
8. The system of claim 7, wherein the collapse processing unit comprises:
the splitting module is used for splitting the first hash character string into a plurality of sub character strings at fixed intervals and distributing the same weight value to each sub character string obtained by splitting;
the SimHash module is used for processing the split substrings and the weight values corresponding to the split substrings by using a SimHash algorithm to obtain third Hash strings corresponding to the first Hash strings;
and the cutting module is used for cutting the third Hash character string so that the length of the cut second Hash character string is smaller than that of the first Hash character string, and the corresponding relation between the second Hash character string and the first Hash character string is not split.
9. The system of claim 8, wherein the retrieval unit comprises:
the target character string processing module is used for splitting the target character string into a plurality of phrases and distributing a corresponding weight value to each phrase;
the fourth hash character string determining module is used for processing the split word group and the weight value corresponding to the split word group by using a SimHash algorithm to obtain a fourth hash character string corresponding to the target character string;
the collapse processing module is used for collapsing the fourth hash character string to obtain a fifth hash character string with the length smaller than that of the fourth hash character string;
the intermediate retrieval module is used for retrieving the fifth hash character string in the prefix tree to obtain a first result set;
a second retrieval module, configured to establish a new prefix tree for the first result set, and retrieve the fourth hash string in the new prefix tree to obtain a second result set;
a result determination module to treat the second set of results as a set of strings similar to the fourth hash string.
10. The system of claim 7, wherein the retrieval unit comprises:
the editing distance calculation module is used for searching downwards layer by layer from a top node of the prefix tree and calculating the editing distance between the current node and the hash character string of the target character string;
the first judgment module is used for repeating the processing process of the edit distance calculation module when the edit distance is smaller than a specified threshold value so as to complete the search of child nodes;
the second judgment module is used for stopping the searching process of the current node and the child node of the current node when the editing distance reaches the specified threshold value, and searching layer by layer from the next node of the brother node which is in the same level as the current node;
a third judging module, configured to, if there is no child node in the current node, consider that the hash character string corresponding to the node is similar to the hash character string of the target character string, terminate the retrieval process of the current node, and then start searching layer by layer from a next node of a sibling node that is at the same level as the current node;
and the retrieval ending module is used for ending the retrieval process of the similar character strings when all the nodes in the prefix tree are traversed or the searching process is stopped.
CN201710558849.4A 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings Active CN109241124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710558849.4A CN109241124B (en) 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710558849.4A CN109241124B (en) 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings

Publications (2)

Publication Number Publication Date
CN109241124A CN109241124A (en) 2019-01-18
CN109241124B true CN109241124B (en) 2023-03-10

Family

ID=65083305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710558849.4A Active CN109241124B (en) 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings

Country Status (1)

Country Link
CN (1) CN109241124B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI762764B (en) * 2019-02-15 2022-05-01 國風傳媒有限公司 Apparatus, method, and computer program product thereof for integrating terms

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6714927B1 (en) * 1999-08-17 2004-03-30 Ricoh Company, Ltd. Apparatus for retrieving documents
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN105913094A (en) * 2016-05-03 2016-08-31 中国科学院信息工程研究所 Minimum distance string calculation searching method
CN106033426A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 A latent semantic min-Hash-based image retrieval method
CN106407447A (en) * 2016-09-30 2017-02-15 福州大学 Simhash-based fuzzy sequencing searching method for encrypted cloud data
CN106909575A (en) * 2015-12-23 2017-06-30 北京国双科技有限公司 Text clustering method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6714927B1 (en) * 1999-08-17 2004-03-30 Ricoh Company, Ltd. Apparatus for retrieving documents
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN106033426A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 A latent semantic min-Hash-based image retrieval method
CN106909575A (en) * 2015-12-23 2017-06-30 北京国双科技有限公司 Text clustering method and device
CN105913094A (en) * 2016-05-03 2016-08-31 中国科学院信息工程研究所 Minimum distance string calculation searching method
CN106407447A (en) * 2016-09-30 2017-02-15 福州大学 Simhash-based fuzzy sequencing searching method for encrypted cloud data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LSH Ensemble: Internet-Scale Domain Search;Erkang Zhu 等;《https://arxiv.org/abs/1603.07410v4》;20160723;1-14 *
一种云存储名字空间架构的研究与设计;赵宁;《中国优秀硕士学位论文全文数据库》;20150715(第(2015)第07期);I137-54 *
加密云数据下基于Simhash的模糊排序搜索方案;杨旸 等;《计算机学报》;20160919;第40卷(第02期);431-444 *
基于语义指纹的海量文本快速相似检测算法研究;姜雪 等;《电脑知识与技术》;20161225;第12卷(第36期);175-177 *

Also Published As

Publication number Publication date
CN109241124A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
CN111324784B (en) Character string processing method and device
CN106055574B (en) Method and device for identifying illegal uniform resource identifier (URL)
CN107102981B (en) Word vector generation method and device
US10467271B2 (en) Search apparatus and search method
US20180107933A1 (en) Web page training method and device, and search intention identifying method and device
US9195738B2 (en) Tokenization platform
CN108027814B (en) Stop word recognition method and device
US20160147867A1 (en) Information matching apparatus, information matching method, and computer readable storage medium having stored information matching program
CN106708798A (en) String segmentation method and device
CN113660541B (en) Method and device for generating abstract of news video
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
US20180276244A1 (en) Method and system for searching for similar images that is nearly independent of the scale of the collection of images
CN109902290A (en) A kind of term extraction method, system and equipment based on text information
CN113094559A (en) Information matching method and device, electronic equipment and storage medium
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
CN109670153B (en) Method and device for determining similar posts, storage medium and terminal
CN109241124B (en) Method and system for quickly retrieving similar character strings
CN114090735A (en) Text matching method, device, equipment and storage medium
CN111753735B (en) Video clip detection method and device, electronic equipment and storage medium
CN106202127B (en) Method and device for processing retrieval request by vertical search engine
WO2016101737A1 (en) Search query method and apparatus
CN110555199B (en) Article generation method, device, equipment and storage medium based on hotspot materials
CN108304453B (en) Method and device for determining video related search terms
KR102609616B1 (en) Method and apparatus for image processing, electronic device and computer readable storage medium
JP2001101184A (en) Method and device for generating structurized document and storage medium with structurized document generation program stored therein

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room 703, No. 2, Boyun Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Patentee after: Hujiang Education & Technology (Shanghai) Corp.,Ltd.

Address before: 201203 room 703, No. 2, Boyun Road, pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee before: HUJIANG EDUCATION TECHNOLOGY (SHANGHAI) CO.,LTD.

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20230825

Address after: Room C4207, Building 1168 West (C Building), No. 1687 Changyang Road, Yangpu District, Shanghai, 200082

Patentee after: Shanghai Xinhu Education Technology Co.,Ltd.

Address before: Room 703, No. 2, Boyun Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Patentee before: Hujiang Education & Technology (Shanghai) Corp.,Ltd.

TR01 Transfer of patent right