CN102750379A - Fast character string matching method based on filtering type - Google Patents

Fast character string matching method based on filtering type Download PDF

Info

Publication number
CN102750379A
CN102750379A CN2012102118297A CN201210211829A CN102750379A CN 102750379 A CN102750379 A CN 102750379A CN 2012102118297 A CN2012102118297 A CN 2012102118297A CN 201210211829 A CN201210211829 A CN 201210211829A CN 102750379 A CN102750379 A CN 102750379A
Authority
CN
China
Prior art keywords
string
text
pattern
length
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012102118297A
Other languages
Chinese (zh)
Other versions
CN102750379B (en
Inventor
李拥军
邹少聪
林浩
黄格仕
谢豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201210211829.7A priority Critical patent/CN102750379B/en
Publication of CN102750379A publication Critical patent/CN102750379A/en
Application granted granted Critical
Publication of CN102750379B publication Critical patent/CN102750379B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a fast character string matching method based on a filtering type. According to the method, a pattern string is preprocessed: the prefix P0 of the pattern string is cut into (k+s) pattern string blocks with the length being h, the length of each pattern string block is lengthened by (k+q-1), and the lengthened pattern string blocks are respectively recorded as Q1, Q2 to Q(k+s); and then, from the initial position of a text string, q characters of the text string are sequentially read every h length and are used as text string indexes, and the text string indexes are respectively marked as d1, d2 to dn/h; a matching number group B[d, j] is recreated: if one text string index di belongs to Qj, the matching number group B[di, j] is equal to 1; and the matching number of (k+s) continuous text string indexes and the pattern string blocks is calculated, and finally, the approximate matching is detected. The method has the advantages that a dynamic programming algorithm and a filter algorithm are combined, in addition, new filter strategies are added, the average time of the approximate matching is shortened, and the matching performance is greatly improved.

Description

A kind of character string fast matching method based on filter-type
Technical field
The present invention relates to the string matching technical field, particularly relate to a kind of character string fast matching method, belong to information retrieval field and calculation biology field based on filter-type.
Background technology
The string matching problem can be defined as and from given symbol sebolic addressing, find out a pattern with certain attribute, and simple example is from given character string, to find out a given character string." approximate match " generally is meant and between permission pattern and text-string, has some subtle difference." coupling " generally is meant " approximate match ".String matching is the most ancient in the computer science, research one of problem the most widely, and the application of string matching also is seen everywhere.In recent years, academia grows with each passing day to the research interest of string matching, particularly in development swift and violent information retrieval field and calculation biology field.Simultaneously, the application of approximate match also more and more widely.For example, in the information retrieval to the correction of misspelling, the sequence alignment in calculation biology, in signal Processing to the correction of error of transmission, or the like.Why above-mentioned phenomenon is arranged, not only because in these two research fields, need the text scale of processing increasing, and because the search that need in text, become increasingly complex.
The research of the coupling of pairing approximation now needs the distance function that can measure two similarity of character string mainly based on distance model in this model.In approximate match, provide the threshold values k of a pattern string and a correspondence usually, wherein, k has specified the ultimate range that allows between pattern string and the matching result.The definition of approximate match is following: under the editing distance model, a given integer k and two character strings are respectively Pattern [1 ... M], Text [1 ... N], find out all the substring p among the text string Text, make editing distance edit (Pattern, p)≤k.In actual applications, a lot of problems are to have used the editing distance model or used its variant.
Existing approximate adaptation method is divided into four types: the first kind is based on dynamic programming method; Second type is based on the NFA search; The 3rd type is based on the bit parallel technology, and this is the most successful class methods of current application.Last type is based on the method for filtration, it with one fast filter algorithm skip in the text those bulk that can not have successful match zones, and then utilize a non-filter algorithm to verify whether remaining zone exists successful match.These class methods also are one of the fastest algorithms.
The algorithm that proposes is exactly to utilize the thought of filtering, and makes the efficient of algorithm greatly improve.
Summary of the invention
The object of the present invention is to provide a kind of method of quick approximate mode coupling, especially overcome the restriction of machine word-length, a kind of algorithm that is applicable to the approximate match of long pattern string is provided.
For given text string and pattern string, a kind of character string fast matching method based on filter-type of the present invention is in text string, to seek the substring that has certain similarity with pattern string.The string matching problem be exactly in a big text string search certain pattern string all the position appears.The present invention has filtered out with filtration thought that those can not produce the zone of successful match in the text string, thereby has improved the efficient of algorithm under average case.
For realizing above-mentioned purpose, the present invention adopts following technical scheme:
A kind of character string fast matching method based on filter-type comprises the steps:
(1) pattern string is carried out pre-service: note pattern string length is P for the prefix of (k+s) h 0, with P 0The sub-piece of pattern string that to be cut into k+s length be h prolongs k+q-1 with the length of the sub-piece of each pattern string, and then having length between two continuous mode string pieces is the lap of k+q-1; This k+s prolongs the sub-piece of pattern string afterwards and is designated as Q respectively 1, Q 2..., Q K+sWherein
Figure BDA00001803025800021
M is the number of pattern string character; K is called for short editing distance, 0 ≦ k ﹤ m for to be transformed into the needed minimal action number of times of text string sub-strings to pattern string; S is the minimum number of accurately mating in the matching process Chinese version string indexing (explanation of text string index is in step 2) and the factor of the sub-piece of pattern string, 1 ≦ s ﹤ m; Q is the length of text string index, q ≦ h;
(2) string indexing that creates text: begin from the text string reference position, every q character that reads text string successively at a distance from h length is as the text string index, and the text string index is labeled as d respectively 1, d 2... D N/hQ ≦ h;
(3) establishment coupling array B d, j]: if some text string index d iBelong to Q j, B [ d then i, j]=1; Otherwise B [d i, j]=0; Creating length for each text string is the array M of m ', is used for the coupling number of recording text string indexing, note d iCorresponding array is M Di[1 ... M '], M Di[1 ... M '] be initialized as 0; M '=k+s wherein;
(4) calculate k+s the continuous text string index and the coupling number of the sub-piece of pattern string: text string index and the sub-piece of pattern string are mated; In the process of record match condition, if M occurs Di[j]≤j-k (has j > here; K) situation then stops to make up this array, then continues to make up the array M of next index Di+1If element M Di[m ']>=s, then need further to detect approximate match, continue next step;
(5) detect approximate match: if there is approximate match; Zone then to be detected is positioned at j-(k+s) h-2k-q+2 to j+m-(k+s-1) the h+k-q part of text string; And detect with dynamic programming algorithm should the zone; If detect approximate match, the position appears in all that then provide approximate match, otherwise should there be approximate match in report in the zone.
Character string is the character string that is defined on the finite alphabet.Text string is a character string, and pattern string also is a character string.The string matching problem be exactly in a big text string search certain pattern string all the position appears.Given character string x, y and z claim that x is the prefix of xy, and x is the suffix of yx, and x is the factor of yxz.
Principle of the present invention: if be divided into the k+s sheet to text string, so for the approximate match of any success, having the s sheet at least is accurately to mate in text string.Whether utilize above-mentioned thought, the present invention is through the accurate substring number of coupling in the structure q-gram index calculation text string, confirm to have in this g-gram index section zone to have the possible of approximate match.Adopting length when the g-gram index refers to when setting up index text Text cutting is that q, sliding distance are the moving window of d, and cutting obtains the substring that length is q.In order to calculate the coupling number of q-gram and pattern string, introduce coupling array B [d, j], if some q-gram d belong to Q j, B [d, j]=1 then; Otherwise B [d, j]=0.Like this, for each q-gramd, B [d, *] has provided the match information of each piece d.
After having confirmed that there is the possibility that approximate match in certain section zone, just need verify this section zone.Using verification algorithm more widely, to be based on bit parallel technological, such as the BPM algorithm.But being applicable to, this method accomplishes the approximate match checking more in short-term of pattern string length.The verification algorithm that the present invention chooses is based on the DP algorithm of dynamic programming matrix.The thought of this algorithm is: normal mode string number of successful match in text is seldom; Therefore; In the editing distance matrix M; The value of each row is calculated from top to bottom and is reached k+1 (promptly not matched) soon, if the value of certain matrix element greater than k+1, Search Results just can not depend on this element so.If the value of a matrix element is less than or equal to k, so this element just be called movable.This method only calculates last mobile element of each row, does not need element at the back to be calculated again.
The present invention compares with existing approximate match algorithm, has following remarkable advantage:
(1) a large amount of bit parallel methods of using in the proximity search, a lot of well results are to use the bit parallel method to obtain.Yet the bit parallel algorithm is applicable to the situation that pattern string is short, but a lot of search all are that the long pattern string is carried out in the current demand.If the length of pattern string has surpassed the word length of computing machine, the effect bad luck very that will become of coupling so.And this proximity search algorithm is fit to the approximate match of long pattern string very much, and when the ratio of the length m of pattern string and threshold values k increased, its efficient is improve accordingly also.
(2) before not increasing new filtering policy, the time overhead of approximate match algorithm mainly is to calculate array M and verify possible matching area with DP algorithm.Through increasing filtering policy, calculate the match time of each piece in each q-gram index and the pattern string when having simplified coupling, thereby shortened the time of calculating array M greatly.Under best-case, this algorithm can shorten half the match time than former algorithm.
(3) filter algorithm can only improve the efficient of algorithm under average case, and its advantage is that it(?) not possible need check each character in the text.The efficient of filter algorithm is relevant with the size text that it can filter out, and to error level sensitivity very.Most of filter algorithms are worked finely when error level is low, otherwise just very poor.And this filter algorithm has still kept good performance when error level is higher.Experiment shows, when error level was higher, this algorithm can reduce half the match time more required than similar algorithm.
Description of drawings
Fig. 1 is the processing synoptic diagram of structure index to text string;
Fig. 2 is the processing synoptic diagram of pattern string.
Embodiment:
In order to understand the present invention better, below in conjunction with embodiment the present invention is described further, but the scope that the present invention requires to protect is not limited to the scope that embodiment representes.
Case study on implementation 1: SMS classification
In recent years, China had accelerated social informatization and digitized developing steps, and people want information processed more and more every day.Each application service provider can notify the client with way of short messages in order to remind or to let the client understand relevant information as early as possible, also has user's itself chat message etc. like e-business network commercial product recommending, Web bank's information feedback, the various notice of deducting fees.But the business of Along with people's contact is more and more, the numerous and diverse management that is difficult to that becomes of these information.Everybody needs a kind of very convenient, loose, effective and efficient manner classification note, breaks away from the predicament of directly opening inbox reply various information.Using the present invention's " a kind of character string fast matching method based on filter-type " can realize with the classify function of short breath of masterplate.When the user need divide a new information category, the perhaps information of many type only need be provided, the note that system just can utilize the way of approximate match to classify automatically and receive in the future, applying step is following:
(1) user sets up a new info class, and one or some notes sample as such is provided, and these samples are exactly such pattern string.New short message content is then as the target text string, if certain quasi-mode string and text string are complementary then text string belongs to such.
(2) length of supposing the note sample is m=32, the length q=2 of short message text string indexing; It is k=4 that the user sets similarity, the minimum number s=2 that the factor of note index and the sub-piece of pattern string is accurately mated in matching process; Then wherein
Figure BDA00001803025800041
As shown in Figure 2, pattern string is carried out pre-service: note pattern string length is P for the prefix of (k+s) h=24 0, with P 0The sub-piece of pattern string that to be cut into k+s=6 length be h=4 prolongs k+q-1=5 with the length of the sub-piece of each pattern string, and then having length between two continuous mode string pieces is the lap of k+q-1=5; This k+s=6 prolongs the sub-piece of pattern string afterwards and is designated as Q respectively 1, Q 2..., Q 6
(3) as shown in Figure 1, for new note, suppose that its length is 128; Create the index of note: begin from the note reference position, every q=2 character at a distance from h=4 length short message reading successively is as the new message index, and the note index is labeled as d respectively 1, d 2... D 32
(4) create coupling array B [d, j], if some note index d iBelong to Q j, B [d then i, j]=1; Otherwise B [d i, j]=0; Creating length for each note is the array M of m ', is used to write down the coupling number of note index, note d iCorresponding array is M Di[6], M Di[6] be initialized as 0;
(5) calculate k+s=6 the continuous note index and the coupling number of the sub-piece of pattern string: note index and the sub-piece of pattern string are mated; In the process of record match condition, if M occurs Di[b]≤b-4 (has b > here; 4) situation then stops to make up this array, then continues to make up the array M of next index Di+1If element M Di[6]>=2, then need further to detect approximate match, continue next step;
(6) detect approximate match: if there is approximate match; Zone then to be detected is arranged in j-32 to the j+14 part (j is certain end position at the diary document that once matees continuous 6 last index of index) of note; And detect with dynamic programming algorithm should the zone; If detect approximate match, the position appears in all that then provide approximate match, otherwise should there be approximate match in report in the zone.A note can belong to a plurality of types, successfully meets certain type masterplate requirement with regard to the partial content that proves this note as long as mate.
Case study on implementation 2: the diary document is searched for generally
Developing rapidly and the quickening of the speed of living of Along with computer technology, with paper pen keep a diary, the finish the work situation of document, people more were partial to accomplish text work with computing machine now with respect in the past.Mainly contain three reasons: 1) computer technology makes people accomplish text work more easily with quick; 2) the computing machine diary has more the amusement color, like editor's mood; 3) satisfy people's requirement of record conveniently.Finish the back than one day and accomplish one piece of diary or work report; Instant recording more meets people's demand; Development along with mobile phone hardware technology, handwritten word identification and speech recognition technology; We are more convenient more directly to write down daily information, and through Computer Recognition Technology, these multimedia messagess will be converted into text and store.
For these documents more at random, that literary is not strong, to use the present invention's " quick approximate mode matching process of a kind of filter-type " and can satisfy the demand that the user searches for daily record generally, applying step is following:
(1) user specifies the diary document as the text string in the present embodiment, and the input inquiry statement is as the pattern string among the present invention;
(2) length of query statement is designated as m=16, and the user sets inquiry similarity k=2, and the minimum number s that the factor of diary document index and the sub-piece of query statement is accurately mated in matching process is 1, the length q=3 of diary document index; Then
Figure BDA00001803025800051
Query statement is carried out pre-service: the length of note query statement is P for the prefix of (k+s) h=12 0, with P 0Be cut into k+s=3 length and be 4 the sub-piece of query statement, the length of the sub-piece of each query statement is prolonged k+q-1=4, then having length between two continuous-query sentence pieces is 4 lap; The sub-piece of query statement after these 3 prolongations is designated as Q respectively 1, Q 2, Q 3
(3) create the diary document index: the length of supposing diary is 1000.Begin from diary document reference position, every 3 characters that read the diary document successively at a distance from 4 characters are as the diary document index, and the diary document index is labeled as d respectively 1, d 2... d 250
(4) create coupling array B [d, j], if some diary document index d iBelong to Q j, B [d then i, j]=1; Otherwise B [d i, j]=0; For each diary document creation length is 3 array M, be used for the coupling number of recording diaries document index, note d iCorresponding array is M Di[3], M Di[3] be initialized as 0;
(5) the coupling number of 3 continuous diary document index of calculating and the sub-piece of query statement: diary document index and the sub-piece of query statement are mated; In the process of record match condition, if M occurs Di[b]≤b-2 (has b > here; 2) situation then stops to make up this array, then continues to make up the array M of next index Di+1If element M Di[m ']>=s=1, then need further to detect approximate match, continue next step;
(6) detect approximate match: if there is approximate match; Zone then to be detected is arranged in j-17 to j+7 (j is certain end position at the diary document that once matees continuous 3 last index of the index) part of diary document; And detect with dynamic programming algorithm should the zone; If detect approximate match, the position appears in all that then provide approximate match, otherwise should there be approximate match in report in the zone.As long as exist the somewhere document that the match is successful to provide the user to check in detail as candidate documents.

Claims (3)

1. the character string fast matching method based on filter-type is characterized in that comprising the steps:
(1) pattern string is carried out pre-service: note pattern string length is P for the prefix of (k+s) h 0, with P 0The sub-piece of pattern string that to be cut into k+s length be h prolongs k+q-1 with the length of the sub-piece of each pattern string, and then having length between two continuous mode string pieces is the lap of k+q-1; This k+s prolongs the sub-piece of pattern string afterwards and is designated as Q respectively 1, Q 2..., Q K+sWherein
Figure FDA00001803025700011
M is the number of pattern string character; K is called for short editing distance, 0 ≦ k ﹤ m for to be transformed into the needed minimal action number of times of text string sub-strings to pattern string; S is the minimum number of accurately mating in the factor of matching process Chinese version string indexing and the sub-piece of pattern string, 1 ≦ s ﹤ m; Q is the length of text string index, q ≦ h;
(2) string indexing that creates text: begin from the text string reference position, every q character that reads text string successively at a distance from h length is as the text string index, and the text string index is labeled as d respectively 1, d 2... D N/hQ ≦ h;
(3) create coupling array B [d, j]: if some text string index d iBelong to Q j, then mate array B [d i, j]=1; Otherwise coupling array B [d i, j]=0; Creating length for each text string is the array M of m ', is used for the coupling number of recording text string indexing, note d iCorresponding array is M Di[1...m '], M Di[1...m '] be initialized as 0; M '=k+s wherein;
(4) calculate k+s the continuous text string index and the coupling number of the sub-piece of pattern string: text string index and the sub-piece of pattern string are mated; In the process of record match condition, if M occurs DiThe situation of [j]≤j-k then stops to make up this array, then continues to make up the array M of next index Di+1If element M Di[m ']>=s, then need further to detect approximate match, continue next step;
(5) detect approximate match: if there is approximate match; Zone then to be detected is positioned at j-(k+s) h-2k-q+2 to j+m-(k+s-1) the h+k-q part of text string; And detect with dynamic programming algorithm should the zone; If detect approximate match, the position appears in all that then provide approximate match, otherwise should there be approximate match in report in the zone.
2. the character string fast matching method based on filter-type according to claim 1 is characterized in that: said text string is a character string, and pattern string is a character string; Said character string is the character string that is defined on the finite alphabet.
3. the character string fast matching method based on filter-type according to claim 1 is characterized in that: given character string x, and y and z claim that x is the prefix of xy, and x is the suffix of yx, and x is the factor of yxz.
CN201210211829.7A 2012-06-25 2012-06-25 Fast character string matching method based on filtering type Expired - Fee Related CN102750379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210211829.7A CN102750379B (en) 2012-06-25 2012-06-25 Fast character string matching method based on filtering type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210211829.7A CN102750379B (en) 2012-06-25 2012-06-25 Fast character string matching method based on filtering type

Publications (2)

Publication Number Publication Date
CN102750379A true CN102750379A (en) 2012-10-24
CN102750379B CN102750379B (en) 2014-07-02

Family

ID=47030564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210211829.7A Expired - Fee Related CN102750379B (en) 2012-06-25 2012-06-25 Fast character string matching method based on filtering type

Country Status (1)

Country Link
CN (1) CN102750379B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544139A (en) * 2012-07-13 2014-01-29 江苏新瑞峰信息科技有限公司 Forward word segmentation method and device based on Chinese retrieval
CN103577598A (en) * 2013-11-15 2014-02-12 曙光信息产业(北京)有限公司 Matching method and device for pattern string and text string
CN107451125A (en) * 2017-08-19 2017-12-08 洪志令 A kind of method that quick close semantic matches are carried out for order outlier group
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching
CN109241360A (en) * 2018-08-21 2019-01-18 阿里巴巴集团控股有限公司 The matching process and device and electronic equipment of combining characters string
CN109271419A (en) * 2018-09-03 2019-01-25 河北工业大学 A kind of online string matching method of gapless constraint
CN111159490A (en) * 2019-12-13 2020-05-15 杭州迪普科技股份有限公司 Method, device and equipment for processing mode character string
CN111814009A (en) * 2020-06-28 2020-10-23 四川长虹电器股份有限公司 BF improved algorithm based on search engine retrieval information pattern matching
CN111832554A (en) * 2019-04-15 2020-10-27 顺丰科技有限公司 Image detection method, device and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
《Lecture Notes in Computer Science》 19951231 Erkki Sutinen et al. "On using q-gram locations in approximate string matching" 327-340 1-3 第979卷, *
《华南理工大学学报(自然科学版)》 20120630 李拥军等 "一种快速近似模式匹配算法" 103-108 1-3 第40卷, 第6期 *
《计算机研究与发展》 20101231 孙德才等 "基于匹配区域特征的相似字符串匹配过滤算法" 663-670 1-3 第47卷, 第4期 *
ERKKI SUTINEN ET AL.: ""On using q-gram locations in approximate string matching"", 《LECTURE NOTES IN COMPUTER SCIENCE》 *
孙德才等: ""基于匹配区域特征的相似字符串匹配过滤算法"", 《计算机研究与发展》 *
李拥军等: ""一种快速近似模式匹配算法"", 《华南理工大学学报(自然科学版)》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544139A (en) * 2012-07-13 2014-01-29 江苏新瑞峰信息科技有限公司 Forward word segmentation method and device based on Chinese retrieval
CN103577598A (en) * 2013-11-15 2014-02-12 曙光信息产业(北京)有限公司 Matching method and device for pattern string and text string
CN103577598B (en) * 2013-11-15 2017-02-15 曙光信息产业(北京)有限公司 Matching method and device for pattern string and text string
CN107451125B (en) * 2017-08-19 2021-05-18 洪志令 Method for performing rapid close semantic matching aiming at sequence-independent item groups
CN107451125A (en) * 2017-08-19 2017-12-08 洪志令 A kind of method that quick close semantic matches are carried out for order outlier group
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching
CN109241360A (en) * 2018-08-21 2019-01-18 阿里巴巴集团控股有限公司 The matching process and device and electronic equipment of combining characters string
CN109241360B (en) * 2018-08-21 2021-08-20 创新先进技术有限公司 Matching method and device of combined character strings and electronic equipment
CN109271419A (en) * 2018-09-03 2019-01-25 河北工业大学 A kind of online string matching method of gapless constraint
CN109271419B (en) * 2018-09-03 2021-06-22 河北工业大学 Online string matching method without gap constraint
CN111832554A (en) * 2019-04-15 2020-10-27 顺丰科技有限公司 Image detection method, device and storage medium
CN111159490A (en) * 2019-12-13 2020-05-15 杭州迪普科技股份有限公司 Method, device and equipment for processing mode character string
CN111159490B (en) * 2019-12-13 2023-05-26 杭州迪普科技股份有限公司 Method, device and equipment for processing pattern character strings
CN111814009A (en) * 2020-06-28 2020-10-23 四川长虹电器股份有限公司 BF improved algorithm based on search engine retrieval information pattern matching

Also Published As

Publication number Publication date
CN102750379B (en) 2014-07-02

Similar Documents

Publication Publication Date Title
CN102750379B (en) Fast character string matching method based on filtering type
Kaleel et al. Cluster-discovery of Twitter messages for event detection and trending
CN108829658B (en) Method and device for discovering new words
CN101694670B (en) Chinese Web document online clustering method based on common substrings
US20100254613A1 (en) System and method for duplicate text recognition
CN107862070B (en) Online classroom discussion short text instant grouping method and system based on text clustering
Kern et al. Teambeam-meta-data extraction from scientific literature
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
WO2008073820A1 (en) Identifying relationships among database records
CN105488471B (en) A kind of font recognition methods and device
US9256669B2 (en) Stochastic document clustering using rare features
Noering et al. Pattern discovery in time series using autoencoder in comparison to nonlearning approaches
Xue et al. Dealing with imbalanced dataset: A re-sampling method based on the improved SMOTE algorithm
CN103984700A (en) Heterogeneous data analysis method for vertical search of scientific information
Zaharieva et al. Cross-platform social event detection
Prilepok et al. Spam detection using data compression and signatures
CN107657067B (en) Cosine distance-based leading-edge scientific and technological information rapid pushing method and system
Pereira et al. A generic Web‐based entity resolution framework
De França Scalable overlapping co-clustering of word-document data
KR20100105080A (en) Query processing method and apparatus based on n-gram
Wang et al. MapReduce based personalized locality sensitive hashing for similarity joins on large scale data
Burk et al. Apollo: Near-duplicate detection for job ads in the online recruitment domain
Zhou et al. A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data
KR101271171B1 (en) Apparatus and method for providing content-related information based on user-selected keywords
CN110096640A (en) User's similarity calculating method in Collaborative Filtering Recommendation System based on classification of the items

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140702

Termination date: 20210625