CN109241124A - A kind of method and system of quick-searching similar character string - Google Patents

A kind of method and system of quick-searching similar character string Download PDF

Info

Publication number
CN109241124A
CN109241124A CN201710558849.4A CN201710558849A CN109241124A CN 109241124 A CN109241124 A CN 109241124A CN 201710558849 A CN201710558849 A CN 201710558849A CN 109241124 A CN109241124 A CN 109241124A
Authority
CN
China
Prior art keywords
character string
hash
hash character
node
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710558849.4A
Other languages
Chinese (zh)
Other versions
CN109241124B (en
Inventor
李光曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinhu Education Technology Co ltd
Original Assignee
Shanghai Education Technology (shanghai) Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Education Technology (shanghai) Ltd By Share Ltd filed Critical Shanghai Education Technology (shanghai) Ltd By Share Ltd
Priority to CN201710558849.4A priority Critical patent/CN109241124B/en
Publication of CN109241124A publication Critical patent/CN109241124A/en
Application granted granted Critical
Publication of CN109241124B publication Critical patent/CN109241124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application provides a kind of method and system of quick-searching similar character string, wherein, which comprises read the textual entry of existing preset quantity, and be directed to every textual entry, the textual entry is split as several phrases, and distributes corresponding weighted value for each phrase;Weighted value based on distribution carries out Hash operation to the phrase after fractionation, to obtain the corresponding first Hash character string of the textual entry;Processing of collapsing is carried out to the first Hash character string, to obtain the second Hash character string that length meets specified requirements;Prefix trees are established to the second Hash character string, and retrieve character string similar with target string from the textual entry of the existing preset quantity based on the prefix trees.The speed of string search can be greatly improved in technical solution provided by the present application.

Description

A kind of method and system of quick-searching similar character string
Technical field
This application involves technical field of information processing, in particular to the method for a kind of quick-searching similar character string and it is System.
Background technique
In current technical field of information processing, it is often necessary to inquiry and target string in the textual entry of magnanimity Similar character string, existing algorithm be each character string in the textual entry to target string and magnanimity calculate editor away from From, and all character strings that editing distance is less than some threshold value are classified as similar character string.
This method time complexity in the prior art is high, in the case where hundreds of thousands of textual entries often performance without Method reaches commercial require.In addition to needing the textual entry number that compares, the time complexity of existing algorithm also with all texts The character string average length of entry is related, can not be using in the scene of big data quantity by now.
Summary of the invention
A kind of method and system for being designed to provide quick-searching similar character string of the application embodiment, Neng Gouji The big speed for improving string search.
To achieve the above object, on the one hand the application provides the method for a kind of quick-searching similar character string, the method Include:
The textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split as Several phrases, and corresponding weighted value is distributed for each phrase;
Weighted value based on distribution carries out Hash operation to the phrase after fractionation, corresponding to obtain the textual entry First Hash character string;
Processing of collapsing is carried out to the first Hash character string, to obtain the second Hash character that length meets specified requirements String;
Prefix trees are established to the second Hash character string, and based on the prefix trees from the existing preset quantity Character string similar with target string is retrieved in textual entry.
In the present embodiment, distributing corresponding weighted value for each phrase includes:
According to the relevance of current phrase and the textual entry, corresponding weighted value is distributed for the current word group;Its In, the relevance is higher, and corresponding weighted value is then bigger.
In the present embodiment, carrying out Hash operation to the phrase after fractionation includes:
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the text The corresponding first Hash character string of entry.
In the present embodiment, include: to the first Hash character string processing that collapse
The first Hash character string is split as multiple substrings according to fixed intervals, and for split obtain it is each Substring distributes same weighted value;
Using SimHash algorithm, to after fractionation substring and its corresponding weighted value handle, it is described to obtain The corresponding third Hash character string of first Hash character string;
If desired, the third Hash character string is cut, so that the second Hash character string after cutting Length is less than the length of the first Hash character string, and the second Hash character string and the first Hash character string it Between corresponding relationship do not isolated.
In the present embodiment, retrieving character string similar with target string includes:
The target string is split as several phrases, and distributes corresponding weighted value for each phrase;
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the target The corresponding 4th Hash character string of character string;
Processing of collapsing is carried out to the 4th Hash character string, to obtain the of length less than the 4th Hash character string Five Hash character strings;
The 5th Hash character string is retrieved in the prefix trees, to obtain the first results set;
New prefix trees are established to first results set, and to the 4th Hash word in the new prefix trees Symbol string is retrieved, to obtain the second results set;
Using second results set as the set of character string similar with the 4th Hash character string.
In the present embodiment, retrieved from the textual entry of the existing preset quantity based on the prefix trees with The similar character string of target string includes:
S51: the successively search downwards since the top mode of the prefix trees, and calculate present node and the target word Editing distance between symbol string;
S52: when the editing distance is less than specified threshold, repeating step S51, to complete the search to child node;
S53: when the editing distance reaches the specified threshold, stop the son of present node and the present node The search process of node, and successively searched since the next node for being in the brotgher of node at the same level with the present node Rope;
S54: if without child node in present node, then it is assumed that the corresponding Hash character string of the node and target string Hash character string be similar, and stop the retrieving of present node, be then at the same level from the present node The next node of the brotgher of node starts successively to scan for;
S55: when node all in the prefix trees traversed finish or search process stop when, terminate similar character Accord with the retrieving of string.
To achieve the above object, the application also provides a kind of system of quick-searching similar character string, the system comprises:
Textual entry processing unit for reading the textual entry of existing preset quantity, and is directed to every textual entry, The textual entry is split as several phrases, and distributes corresponding weighted value for each phrase;
First Hash character string determination unit carries out Hash fortune to the phrase after fractionation for the weighted value based on distribution It calculates, to obtain the corresponding first Hash character string of the textual entry;
Collapse processing unit, for carrying out processing of collapsing to the first Hash character string, with obtain length meet it is specified Second Hash character string of condition;
Retrieval unit, for establishing prefix trees to the second Hash character string, and based on the prefix trees from it is described Character string similar with target string is retrieved in the textual entry of some preset quantities.
In the present embodiment, the processing unit of collapsing includes:
Module is split, for the first Hash character string to be split as multiple substrings according to fixed intervals, and is It splits obtained each substring and distributes same weighted value;
SimHash module, for utilize SimHash algorithm, to after fractionation substring and its corresponding weighted value into Row processing, to obtain the corresponding third Hash character string of the first Hash character string;
Module is cut, for cutting to the third Hash character string, so that the second Hash character after cutting The length of string is less than the length of the first Hash character string, and the second Hash character string and the first Hash character Corresponding relationship between string is not isolated.
In the present embodiment, the retrieval unit includes:
Target string processing module for the target string to be split as several phrases, and is each phrase point With corresponding weighted value;
4th Hash character string determining module, for utilize SimHash algorithm, to after fractionation phrase and its corresponding power Weight values are handled, to obtain the corresponding 4th Hash character string of the target string;
Collapse processing module, for carrying out processing of collapsing to the 4th Hash character string, with obtain length be less than it is described 5th Hash character string of the 4th Hash character string;
Intermediate retrieval module, for being retrieved in the prefix trees to the 5th Hash character string, to obtain One results set;
Retrieval module again, for establishing new prefix trees to first results set, and in the new prefix trees In the 4th Hash character string is retrieved, to obtain the second results set;
As a result determining module, for using second results set as character similar with the 4th Hash character string The set of string.
In the present embodiment, the retrieval unit includes:
Editing distance computing module for the successively search downwards since the top mode of the prefix trees, and calculates and works as Editing distance between front nodal point and the Hash character string of the target string;
First determination module, for repeating the editing distance and calculating mould when the editing distance is less than specified threshold The treatment process of block, to complete the search to child node;
Second determination module, for stopping present node and institute when the editing distance reaches the specified threshold The search process of the child node of present node is stated, and is opened from the next node for being in the brotgher of node at the same level with the present node Beginning successively scans for;
Third determination module, if be used in present node without child node, then it is assumed that the corresponding Hash character of the node String and the Hash character string of target string are similar, and stop the retrieving of present node, then from it is described currently The next node that node is in the brotgher of node at the same level starts successively to scan for;
Ending module is retrieved, has traversed and has finished or search process stops for all nodes in the prefix trees When, terminate the retrieving of similar character string.
The extremely complex and huge text affinity matching process of operand can be converted to have several and closed by the present invention The lookup of the prefix trees of connection relationship or dynamic generation can be matched roughly the same in a certain range by control similarity threshold Similar Text.The time complexity of the algorithm compared with the editing distance between calculating character string one by one for, small several numbers Magnitude, to greatly improve recall precision.
Referring to following description and accompanying drawings, specific implementations of the present application are disclosed in detail, specify the original of the application Reason can be in a manner of adopted.It should be understood that presently filed embodiment is not so limited in range.In appended power In the range of the spirit and terms that benefit requires, presently filed embodiment includes many changes, modifications and is equal.
The feature for describing and/or showing for a kind of embodiment can be in a manner of same or similar one or more It uses in a other embodiment, is combined with the feature in other embodiment, or the feature in substitution other embodiment.
It should be emphasized that term "comprises/comprising" refers to the presence of feature, one integral piece, step or component when using herein, but simultaneously It is not excluded for the presence or additional of one or more other features, one integral piece, step or component.
Detailed description of the invention
Included attached drawing is used to provide to be further understood from the application embodiment, and which constitute the one of specification The principle of the application for illustrating presently filed embodiment, and with verbal description is come together to illustrate in part.It should be evident that The accompanying drawings in the following description is only some embodiments of the application, for those of ordinary skill in the art, is not being paid Out under the premise of creative labor, it is also possible to obtain other drawings based on these drawings.In the accompanying drawings:
Fig. 1 is the Establishing process figure of prefix trees in the application embodiment;
Fig. 2 is the retrieval flow figure of similar character string in the application embodiment;
Fig. 3 is the retrieval flow figure of similar character string in the application another embodiment;
Fig. 4 is the functional block diagram of the system of quick-searching similar character string in the application embodiment.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in mode is applied, the technical solution in the application embodiment is clearly and completely described, it is clear that described Embodiment is only a part of embodiment of the application, rather than whole embodiments.Based on the embodiment party in the application Formula, all other embodiment obtained by those of ordinary skill in the art without making creative efforts, is all answered When the range for belonging to the application protection.
Referring to Fig. 1, the application provides a kind of method of quick-searching similar character string, which comprises
S1: the textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split For several phrases, and corresponding weighted value is distributed for each phrase;
S2: the weighted value based on distribution carries out Hash operation to the phrase after fractionation, corresponding to obtain the textual entry The first Hash character string;
S3: processing of collapsing is carried out to the first Hash character string, to obtain the second Hash that length meets specified requirements Character string;
S4: prefix trees are established to the second Hash character string, and are based on the prefix trees from the existing present count Character string similar with target string is retrieved in the textual entry of amount.
In the present embodiment, distributing corresponding weighted value for each phrase includes:
According to the relevance of current phrase and the textual entry, corresponding weighted value is distributed for the current word group;Its In, the relevance is higher, and corresponding weighted value is then bigger.
In the present embodiment, the relevance of current phrase and textual entry can by calculate vector between space away from From determining.Specifically, current phrase and textual entry can be converted into term vector, in this way, by calculate two words to Space length between amount may thereby determine that the relevance between the two, and distance is closer, and relevance is then higher.
In the present embodiment, carrying out Hash operation to the phrase after fractionation includes:
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the text The corresponding first Hash character string of entry.
In the present embodiment, include: to the first Hash character string processing that collapse
The first Hash character string is split as multiple substrings according to fixed intervals, and for split obtain it is each Substring distributes same weighted value;
Using SimHash algorithm, to after fractionation substring and its corresponding weighted value handle, it is described to obtain The corresponding third Hash character string of first Hash character string;
The third Hash character string is cut, so that the length of the second Hash character string after cutting is less than institute The length of the first Hash character string is stated, and the corresponding pass between the second Hash character string and the first Hash character string System is not isolated.
Referring to Fig. 2, in the present embodiment, retrieving character string similar with target string includes:
The target string is split as several phrases, and distributes corresponding weighted value for each phrase;
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the target The corresponding 4th Hash character string of character string;
Processing of collapsing is carried out to the 4th Hash character string, to obtain the of length less than the 4th Hash character string Five Hash character strings;
The 5th Hash character string is retrieved in the prefix trees, to obtain the first results set;
New prefix trees are established to first results set, and to the 4th Hash word in the new prefix trees Symbol string is retrieved, to obtain the second results set;
Using second results set as the set of character string similar with the 4th Hash character string.
Specifically, in an application scenarios, similar character string can be retrieved according to the following steps:
Existing magnanimity character string to be treated is segmented according to segmentation methods, and the text after participle is extracted Feature;
Different weights are assigned to different characteristic, the Hash operation of local sensitivity is carried out to it using SimHash algorithm, obtains Hash character string H1;
Several character fields H2 is cut into according to fixed character interval to H1, it is (logical that consistent weight is arranged to each character field It sets up 1), then to carry out SimHash operation again to the character field H2 after cutting, obtains Hash character string H3;
H3 is cut, so that the length of H3 is less than H1, this process is the collapse of Hash, claims collapse algorithm;
Prefix trees T1 is established to H3, while guaranteeing that the corresponding relationship of H1 and H3 is not isolated;
For the character string H4 to be retrieved of input, its cryptographic Hash H5, H6 twice is calculated in the manner previously described;
The quick similar lookup to T1 tree is completed by H6, obtains a set S1;
Prefix trees T2 is established to S1, the quick similar lookup to T2 is completed by H5, finally obtains a set S2;
Set S2 can consider be H4 similar set.
Referring to Fig. 3, in one embodiment of the application, based on the prefix trees from the existing preset quantity Character string similar with target string is retrieved in textual entry includes:
S51: the successively search downwards since the top mode of the prefix trees, and calculate present node and the target word Editing distance between symbol string;
S52: when the editing distance is less than specified threshold, repeating step S51, to complete the search to child node;
S53: when the editing distance reaches the specified threshold, stop the son of present node and the present node The search process of node, and successively searched since the next node for being in the brotgher of node at the same level with the present node Rope;
S54: if without child node in present node, then it is assumed that the corresponding Hash character string of the node and target string Hash character string be similar, and stop the retrieving of present node, be then at the same level from the present node The next node of the brotgher of node starts successively to scan for;
S55: when node all in the prefix trees traversed finish or search process stop when, terminate similar character Accord with the retrieving of string.
Referring to Fig. 4, the application also provides a kind of system of quick-searching similar character string, the system comprises:
Textual entry processing unit 100 for reading the textual entry of existing preset quantity, and is directed to every text item The textual entry is split as several phrases, and distributes corresponding weighted value for each phrase by mesh;
First Hash character string determination unit 200 carries out Hash to the phrase after fractionation for the weighted value based on distribution Operation, to obtain the corresponding first Hash character string of the textual entry;
Processing unit 300 of collapsing for carrying out processing of collapsing to the first Hash character string meets finger to obtain length Second Hash character string of fixed condition;
Retrieval unit 400, for establishing prefix trees to the second Hash character string, and based on the prefix trees from described Character string similar with target string is retrieved in the textual entry of existing preset quantity.
In the present embodiment, the processing unit of collapsing includes:
Module is split, for the first Hash character string to be split as multiple substrings according to fixed intervals, and is It splits obtained each substring and distributes same weighted value;
SimHash module, for utilize SimHash algorithm, to after fractionation substring and its corresponding weighted value into Row processing, to obtain the corresponding third Hash character string of the first Hash character string;
Module is cut, for cutting to the third Hash character string, so that the second Hash character after cutting The length of string is less than the length of the first Hash character string, and the second Hash character string and the first Hash character Corresponding relationship between string is not isolated.
In the present embodiment, the retrieval unit includes:
Target string processing module for the target string to be split as several phrases, and is each phrase point With corresponding weighted value;
4th Hash character string determining module, for utilize SimHash algorithm, to after fractionation phrase and its corresponding power Weight values are handled, to obtain the corresponding 4th Hash character string of the target string;
Collapse processing module, for carrying out processing of collapsing to the 4th Hash character string, with obtain length be less than it is described 5th Hash character string of the 4th Hash character string;
Intermediate retrieval module, for being retrieved in the prefix trees to the 5th Hash character string, to obtain One results set;
Retrieval module again, for establishing new prefix trees to first results set, and in the new prefix trees In the 4th Hash character string is retrieved, to obtain the second results set;
As a result determining module, for using second results set as character similar with the 4th Hash character string The set of string.
In the present embodiment, the retrieval unit includes:
Editing distance computing module for the successively search downwards since the top mode of the prefix trees, and calculates and works as Editing distance between front nodal point and the Hash character string of the target string;
First determination module, for repeating the editing distance and calculating mould when the editing distance is less than specified threshold The treatment process of block, to complete the search to child node;
Second determination module, for stopping present node and institute when the editing distance reaches the specified threshold The search process of the child node of present node is stated, and is opened from the next node for being in the brotgher of node at the same level with the present node Beginning successively scans for;
Third determination module, if be used in present node without child node, then it is assumed that the corresponding Hash character of the node String and the Hash character string of target string are similar, and stop the retrieving of present node, then from it is described currently The next node that node is in the brotgher of node at the same level starts successively to scan for;
Ending module is retrieved, has traversed and has finished or search process stops for all nodes in the prefix trees When, terminate the retrieving of similar character string.
The extremely complex and huge text affinity matching process of operand can be converted to have several and closed by the present invention The lookup of the prefix trees of connection relationship or dynamic generation can be matched roughly the same in a certain range by control similarity threshold Similar Text.The time complexity of the algorithm compared with the editing distance between calculating character string one by one for, small several numbers Magnitude, to greatly improve recall precision.
Those skilled in the art are supplied to the purpose described to the description of the various embodiments of the application above.It is not It is intended to exhaustion or be not intended to and limit the invention to single disclosed embodiment.As described above, the application's is various Substitution and variation will be apparent for above-mentioned technology one of ordinary skill in the art.Therefore, although specifically begging for Some alternative embodiments are discussed, but other embodiment will be apparent or those skilled in the art are opposite It is easy to obtain.The application is intended to include all substitutions of the invention discussed herein, modification and variation, and falls in Other embodiment in the spirit and scope of above-mentioned application.

Claims (10)

1. a kind of method of quick-searching similar character string, which is characterized in that the described method includes:
The textual entry of existing preset quantity is read, and is directed to every textual entry, the textual entry is split as several Phrase, and corresponding weighted value is distributed for each phrase;
Weighted value based on distribution carries out Hash operation to the phrase after fractionation, to obtain the textual entry corresponding first Hash character string;
Processing of collapsing is carried out to the first Hash character string, to obtain the second Hash character string that length meets specified requirements;
Prefix trees are established to the second Hash character string, and based on the prefix trees from the text of the existing preset quantity Character string similar with target string is retrieved in entry.
2. the method according to claim 1, wherein including: for the corresponding weighted value of each phrase distribution
According to the relevance of current phrase and the textual entry, corresponding weighted value is distributed for the current word group;Wherein, institute It is higher to state relevance, corresponding weighted value is then bigger.
3. the method according to claim 1, wherein including: to the phrase progress Hash operation after fractionation
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the textual entry Corresponding first Hash character string.
4. the method according to claim 1, wherein carrying out processing packet of collapsing to the first Hash character string It includes:
The first Hash character string is split as multiple substrings according to fixed intervals, and to split each obtained sub- word Symbol string distributes same weighted value;
Using SimHash algorithm, to after fractionation substring and its corresponding weighted value handle, to obtain described first The corresponding third Hash character string of Hash character string;
If it is required, then being cut to the third Hash character string, so that the length of the second Hash character string after cutting Degree is less than the length of the first Hash character string, and between the second Hash character string and the first Hash character string Corresponding relationship do not isolated.
5. according to the method described in claim 4, it is characterized in that, retrieving character string similar with target string and including:
The target string is split as several phrases, and distributes corresponding weighted value for each phrase;
Using SimHash algorithm, to after fractionation phrase and its corresponding weighted value handle, to obtain the target character Go here and there corresponding 4th Hash character string;
Processing of collapsing is carried out to the 4th Hash character string, is breathed out with obtaining the 5th of length less than the 4th Hash character string Uncommon character string;
The 5th Hash character string is retrieved in the prefix trees, to obtain the first results set;
New prefix trees are established to first results set, and to the 4th Hash character string in the new prefix trees It is retrieved, to obtain the second results set;
Using second results set as the set of character string similar with the 4th Hash character string.
6. the method according to claim 1, wherein based on the prefix trees from the existing preset quantity Character string similar with target string is retrieved in textual entry includes:
S51: the successively search downwards since the top mode of the prefix trees, and calculate present node and the target string Between editing distance;
S52: when the editing distance is less than specified threshold, repeating step S51, to complete the search to child node;
S53: when the editing distance reaches the specified threshold, stop the child node of present node and the present node Search process, and successively scanned for since the next node for being in the brotgher of node at the same level with the present node;
S54: if without child node in present node, then it is assumed that the Kazakhstan of the node corresponding Hash character string and target string Uncommon character string is similar, and stops the retrieving of present node, is then in brother at the same level from the present node The next node of node starts successively to scan for;
S55: when node all in the prefix trees traversed finish or search process stop when, terminate similar character string Retrieving.
7. a kind of system of quick-searching similar character string, which is characterized in that the system comprises:
Textual entry processing unit for reading the textual entry of existing preset quantity, and is directed to every textual entry, by institute It states textual entry and is split as several phrases, and distribute corresponding weighted value for each phrase;
First Hash character string determination unit carries out Hash operation to the phrase after fractionation for the weighted value based on distribution, with Obtain the corresponding first Hash character string of the textual entry;
Processing unit of collapsing for carrying out processing of collapsing to the first Hash character string meets specified requirements to obtain length The second Hash character string;
Retrieval unit, for establishing prefix trees to the second Hash character string, and based on the prefix trees from described existing Character string similar with target string is retrieved in the textual entry of preset quantity.
8. system according to claim 7, which is characterized in that the processing unit of collapsing includes:
Module is split, for the first Hash character string to be split as multiple substrings according to fixed intervals, and to split Obtained each substring distributes same weighted value;
SimHash module, for utilizing SimHash algorithm, at the substring and its corresponding weighted value after fractionation Reason, to obtain the corresponding third Hash character string of the first Hash character string;
Module is cut, for cutting to the third Hash character string, so that the second Hash character string after cutting Length is less than the length of the first Hash character string, and the second Hash character string and the first Hash character string it Between corresponding relationship do not isolated.
9. system according to claim 8, which is characterized in that the retrieval unit includes:
Target string processing module for the target string to be split as several phrases, and is the distribution pair of each phrase The weighted value answered;
4th Hash character string determining module, for utilizing SimHash algorithm, to the phrase and its corresponding weighted value after fractionation It is handled, to obtain the corresponding 4th Hash character string of the target string;
Processing module of collapsing is less than the described 4th for carrying out processing of collapsing to the 4th Hash character string to obtain length 5th Hash character string of Hash character string;
Intermediate retrieval module, for being retrieved in the prefix trees to the 5th Hash character string, to obtain the first knot Fruit set;
Retrieval module again, for establishing new prefix trees to first results set, and it is right in the new prefix trees The 4th Hash character string is retrieved, to obtain the second results set;
As a result determining module, for using second results set as character string similar with the 4th Hash character string Set.
10. system according to claim 7, which is characterized in that the retrieval unit includes:
Editing distance computing module for the successively search downwards since the top mode of the prefix trees, and calculates and works as prosthomere Editing distance between point and the Hash character string of the target string;
First determination module, for repeating the editing distance computing module when the editing distance is less than specified threshold Treatment process, to complete the search to child node;
Second determination module stops present node and described works as when the editing distance reaches the specified threshold The search process of the child node of front nodal point, and since the next node for being in the brotgher of node at the same level with the present node by Layer scans for;
Third determination module, if in present node without child node, then it is assumed that the corresponding Hash character string of the node with The Hash character string of target string is similar, and stops the retrieving of present node, then from the present node Next node in the brotgher of node at the same level starts successively to scan for;
Retrieve ending module, for when node all in the prefix trees traversed finish or search process stop when, Terminate the retrieving of similar character string.
CN201710558849.4A 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings Active CN109241124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710558849.4A CN109241124B (en) 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710558849.4A CN109241124B (en) 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings

Publications (2)

Publication Number Publication Date
CN109241124A true CN109241124A (en) 2019-01-18
CN109241124B CN109241124B (en) 2023-03-10

Family

ID=65083305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710558849.4A Active CN109241124B (en) 2017-07-11 2017-07-11 Method and system for quickly retrieving similar character strings

Country Status (1)

Country Link
CN (1) CN109241124B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI762764B (en) * 2019-02-15 2022-05-01 國風傳媒有限公司 Apparatus, method, and computer program product thereof for integrating terms

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6714927B1 (en) * 1999-08-17 2004-03-30 Ricoh Company, Ltd. Apparatus for retrieving documents
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN105913094A (en) * 2016-05-03 2016-08-31 中国科学院信息工程研究所 Minimum distance string calculation searching method
CN106033426A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 A latent semantic min-Hash-based image retrieval method
CN106407447A (en) * 2016-09-30 2017-02-15 福州大学 Simhash-based fuzzy sequencing searching method for encrypted cloud data
CN106909575A (en) * 2015-12-23 2017-06-30 北京国双科技有限公司 Text clustering method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6714927B1 (en) * 1999-08-17 2004-03-30 Ricoh Company, Ltd. Apparatus for retrieving documents
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN106033426A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 A latent semantic min-Hash-based image retrieval method
CN106909575A (en) * 2015-12-23 2017-06-30 北京国双科技有限公司 Text clustering method and device
CN105913094A (en) * 2016-05-03 2016-08-31 中国科学院信息工程研究所 Minimum distance string calculation searching method
CN106407447A (en) * 2016-09-30 2017-02-15 福州大学 Simhash-based fuzzy sequencing searching method for encrypted cloud data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ERKANG ZHU 等: "LSH Ensemble: Internet-Scale Domain Search", 《HTTPS://ARXIV.ORG/ABS/1603.07410V4》 *
姜雪 等: "基于语义指纹的海量文本快速相似检测算法研究", 《电脑知识与技术》 *
杨旸 等: "加密云数据下基于Simhash的模糊排序搜索方案", 《计算机学报》 *
赵宁: "一种云存储名字空间架构的研究与设计", 《中国优秀硕士学位论文全文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI762764B (en) * 2019-02-15 2022-05-01 國風傳媒有限公司 Apparatus, method, and computer program product thereof for integrating terms

Also Published As

Publication number Publication date
CN109241124B (en) 2023-03-10

Similar Documents

Publication Publication Date Title
CN108108351B (en) Text emotion classification method based on deep learning combination model
CN109635273B (en) Text keyword extraction method, device, equipment and storage medium
CN107102981B (en) Word vector generation method and device
CN104408191B (en) The acquisition methods and device of the association keyword of keyword
CN107301170B (en) Method and device for segmenting sentences based on artificial intelligence
CN109101481B (en) Named entity identification method and device and electronic equipment
CN104462085B (en) Search key error correction method and device
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN106033416A (en) A string processing method and device
CN106294350A (en) A kind of text polymerization and device
CN107608960B (en) Method and device for linking named entities
KR20080031262A (en) Relationship networks
CN104778283B (en) A kind of user's occupational classification method and system based on microblogging
CN103760991A (en) Physical input method and physical input device
CN107562772A (en) Event extraction method, apparatus, system and storage medium
CN107807910A (en) A kind of part-of-speech tagging method based on HMM
CN106708798A (en) String segmentation method and device
CN108427686A (en) Text data querying method and device
CN108304377A (en) A kind of extracting method and relevant apparatus of long-tail word
CN108563637A (en) A kind of sentence entity complementing method of fusion triple knowledge base
CN111125396B (en) Image retrieval method of single-model multi-branch structure
CN107784019A (en) Word treatment method and system are searched in a kind of searching service
CN113158667B (en) Event detection method based on entity relationship level attention mechanism
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN109885680B (en) Short text classification preprocessing method, system and device based on semantic extension

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room 703, No. 2, Boyun Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Patentee after: Hujiang Education & Technology (Shanghai) Corp.,Ltd.

Address before: 201203 room 703, No. 2, Boyun Road, pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee before: HUJIANG EDUCATION TECHNOLOGY (SHANGHAI) CO.,LTD.

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20230825

Address after: Room C4207, Building 1168 West (C Building), No. 1687 Changyang Road, Yangpu District, Shanghai, 200082

Patentee after: Shanghai Xinhu Education Technology Co.,Ltd.

Address before: Room 703, No. 2, Boyun Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Patentee before: Hujiang Education & Technology (Shanghai) Corp.,Ltd.

TR01 Transfer of patent right