CN105653567A - Method for quickly looking for feature character strings in text sequential data - Google Patents
Method for quickly looking for feature character strings in text sequential data Download PDFInfo
- Publication number
- CN105653567A CN105653567A CN201410725893.6A CN201410725893A CN105653567A CN 105653567 A CN105653567 A CN 105653567A CN 201410725893 A CN201410725893 A CN 201410725893A CN 105653567 A CN105653567 A CN 105653567A
- Authority
- CN
- China
- Prior art keywords
- field
- searched
- suffix
- similarity
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for quickly looking for feature character strings in a text sequential data. The method comprises the following steps of (1) acquiring a text sequence from information, namely a character string, (2) generating a suffix array, (3) searching in the suffix array and resolving according to binary search. In the third step, according to lines of the suffix matrix, search is conducted to each line; and if a field occurs for designated times in a concentrative way in binary search results, similarity of two fields is calculated and the field close to the similarity most is the candidate field. Advantages of original data in the sequence is effectively utilized, so problems of data analysis complication and slow speed due to limitation of LSH algorithm to the unordered data can be overcome; besides, after fuzzy check, delete and selection can be directly conducted; a candidate part can be directly filtered via similarity calculation; and a problem that a sub-sequence has to be fully matched for similarity search algorithm can be overcome.
Description
Technical field
The present invention relates to a kind of method searching feature string fast, particularly in mass data continuously or the searching of the similar text being interrupted.
Background technology
Now, sequence data is quite general in actual life, comprises information biology, security of system and network connection etc. Meanwhile, similarity is also a basic fundamental during sequence data manages. Now for symbol sequence and time series data, such as DNA sequence dna, stock, network data message and video flowing, had a lot of effective method. For text search, present stage is mainly divided into two classes, a kind of is adopt the position sensing of min-hash to breathe out uncommon algorithm (Locality-SensitiveHashingwithMin-Hash hereinafter uses abbreviation LSH), also having a kind of is based on breathing out the similar section of search wished index, suffix tree and arrange with suffix tree, but in text sequence data, all there is restriction, LSH algorithm only is confined to search unordered data, search speed slowly, and data can not be ignored continuously in the middle of reality uses, because different orders has implied the whole execution process that data produce; And similar section of searching algorithm designs according to the thought of subsequence matching, it is necessary to subsequence matching just can filter out result, the mentality of designing of the present invention does not need subsequence to mate completely.
Summary of the invention
1, the object of the present invention.
The present invention spends slowly to solve prior art Chinese version sequence data similarity middling speed, and data sequence need to mate the problem causing adaptability not strong completely, and proposes a kind of method searching feature string fast.
2, the technical solution adopted in the present invention.
Text sequence data are searched a method for feature string fast, carry out in accordance with the following steps:
(1) text sequence in obtaining information, i.e. character string;
(2) generate Suffix array clustering, carry out in accordance with the following steps:
A, choose above-mentioned text sequence S=e1��enWith one group of mutual individual Hash function H={h1��hn;
B, the Hash result sequence obtaining breathing out uncommon function are hi(S);
C��hi(S)=hi(e1)��hi(en), wherein the suffix matrix of S is Ms,m=,It is hi(S) Suffix array clustering;
(3) in Suffix array clustering, first searched by two points and carry out data decomposition, according to the line number of suffix matrix, often row is searched, if after has there is predetermined number of times in two points of result sets searched in certain field, by calculating the similarity of two fields, immediate field just thinks candidate's field.
Further in specific embodiment, described field adopts LSH algorithm, carry out the screening of candidate's field, wherein sorted by the time between candidate's field, after an event e is considered as the text event of screening, 3 events and the event before it afterwards combine, and form an event document.
Further in specific embodiment, each field adopts different Kazakhstan to wish equation, impartial region is divided according to character section, each region comprises a time point, each time point evenly distribute is breathed out uncommon function and is calculated cryptographic Hash, the cryptographic Hash after calculating, if it is determined that signature is consistent, time similarity, is Query Result.
3, the useful effect of the present invention.
(1) advantage of the original data in the sequence that the present invention effectively utilizes, avoids LSH algorithm and is only confined to unordered data, and the data analysis caused is loaded down with trivial details, slow-footed problem.
(2) the present invention directly carries out after carrying out fuzzy inquiry deleting choosing, calculates similarity direct filtration candidate's part, solves the problem that similar section of searching algorithm requirement subsequence must mate completely.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of the LSH-DOC algorithm of the present invention.
Fig. 2 is the schematic diagram of the LSH-SEP algorithm of the present invention.
Embodiment
In order to enable Patent Office auditor especially the public clearly understand the technical spirit of the present invention and useful effect, applicant will do explanation in detail below by way of example, but it not all the restriction to the present invention program to the description of embodiment, any according to being only formal and the equivalent transformation of unsubstantiality all should be considered as the technical scheme category of the present invention done by present inventive concept.
Embodiment
Definition Suffix array clustering: given one group of text sequence, S=e1��enWith one group of mutual individual Hash function H={h1��hn, allow hi(S) represent for Hash result sequence, hi(S)=hi(e1)��hi(en), wherein the suffix matrix of S is Ms,m=,It is hi(S) Suffix array clustering. Many methods are wherein had to form suffix element group, so the suffix matrix produced can be a lot.
Suffix array clustering is searched for: be divided into two steps, first from suffix matrix, find potential similar section, then directly screened by similarity.
The uncommon function in given one group of Kazakhstan independent mutually and search sequence, generate suffix matrix. Then being searched by two points and decompose, according to the line number of suffix matrix, often row is searched. If after certain field predetermined number of times has occurred in two points of result sets searched, this field just thinks candidate's field.
Following program sequence, shows the process of candidate's field algorithm. H (i) represents that being i-th breathed out in uncommon collection of functions H breathes out uncommon function, QhiExpression is the Hash result that search sequence processed through h (i), SAiBeing suffix matrix M s, in m, the i-th row SAi [j] refers to the i-th row jth element. CompareAt (Qhi,, SAi [j]) and represent to be that two parameters carry out two points and compare, if first parameter is beaten, return 1, the 2nd parameter returns-1 greatly, and other situations return 0. Extract (Qhi, Sai, pos) is used to extract the function of candidate's field.After the algorithm computing of Fig. 1, the result of r time will be extracted, also be exactly predetermined number of times proposed above.
This basis is upper again uses LSH algorithm, directly carries out the screening of candidate's field. Two kinds that Fig. 1, Fig. 2 are LSH realize schematic diagram. Wherein, composition graphs 1, it is believed that each candidate segment is be formed by connecting " document " time, in Fig. 1 after an event e is considered as text event, 3 events after it combine with it, form one " time document ", such as Li+1, Li+2 etc.
Composition graphs 2, in order to shelf time information, we divide the uncommon equation in the Kazakhstan different to each independent section field. For example, the length of each independent section is 4, and we have 40 to breathe out uncommon function. We just distribute each time 10 functions. Then each breathes out each field that uncommon function is used to this section of index. Fig. 2 shows a sequence S and many sections of Li+1, Li+2 etc. Wherein p1 and p4 is 4 regions of each section. Each region comprises a time. Each pj10 are had to breathe out uncommon function to calculate cryptographic Hash. If the signature of two sections is consistent, so each field time be also similar, be so sequentially just retained.
Above-described embodiment does not limit the present invention in any way, and the technical scheme that every employing is equal to replacement or the mode of equivalent transformation obtains all drops in protection scope of the present invention.
Claims (3)
1. text sequence data are searched the method for feature string fast, it is characterised in that carry out in accordance with the following steps:
(1) text sequence in obtaining information, i.e. character string;
(2) generate Suffix array clustering, carry out in accordance with the following steps:
A, choose above-mentioned text sequence S=e1��enWith one group of mutual individual Hash function H={h1��hn;
B, the Hash result sequence obtaining breathing out uncommon function are hi(S);
C��hi(S)=hi(e1)��hi(en), wherein the suffix matrix of S is Ms,m=,It is hi(S) Suffix array clustering;
(3) in Suffix array clustering, first searched by two points and carry out data decomposition, according to the line number of suffix matrix, often row is searched, if after has there is predetermined number of times in two points of result sets searched in certain field, by calculating the similarity of two fields, immediate field just thinks candidate's field.
2. text sequence data according to claim 1 are searched the method for feature string fast, it is characterized in that: in described field, adopt LSH algorithm, carry out the screening of candidate's field, wherein sorted by the time between candidate's field, after an event e is considered as the text event of screening, 3 events and the event before it afterwards combine, and form an event document.
3. text sequence data according to claim 2 are searched the method for feature string fast, it is characterized in that: each field adopts different Kazakhstan to wish equation, impartial region is divided according to character section, each region comprises a time point, each time point evenly distribute is breathed out uncommon function and is calculated cryptographic Hash, the cryptographic Hash after calculating, if it is determined that signature is consistent, time similarity, is Query Result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410725893.6A CN105653567A (en) | 2014-12-04 | 2014-12-04 | Method for quickly looking for feature character strings in text sequential data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410725893.6A CN105653567A (en) | 2014-12-04 | 2014-12-04 | Method for quickly looking for feature character strings in text sequential data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105653567A true CN105653567A (en) | 2016-06-08 |
Family
ID=56480625
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410725893.6A Pending CN105653567A (en) | 2014-12-04 | 2014-12-04 | Method for quickly looking for feature character strings in text sequential data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105653567A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038230A (en) * | 2017-04-07 | 2017-08-11 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | A kind of short message searching method and system based on Suffix array clustering |
CN107193977A (en) * | 2017-05-26 | 2017-09-22 | 刘伟 | Data search method, artificial intelligence system, image processing system, database, search engine, communication system, computer application |
CN108920483A (en) * | 2018-04-28 | 2018-11-30 | 南京搜文信息技术有限公司 | Character string fast matching method based on Suffix array clustering |
CN111538768A (en) * | 2020-06-23 | 2020-08-14 | 平安国际智慧城市科技股份有限公司 | Data query method and device based on N-element model, electronic equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060179052A1 (en) * | 2003-03-03 | 2006-08-10 | Pauws Steffen C | Method and arrangement for searching for strings |
CN102073740A (en) * | 2011-01-27 | 2011-05-25 | 农革 | String suffix array construction method on basis of radix sorting |
CN103399907A (en) * | 2013-07-31 | 2013-11-20 | 深圳市华傲数据技术有限公司 | Method and device for calculating similarity of Chinese character strings on the basis of edit distance |
CN103810228A (en) * | 2012-11-01 | 2014-05-21 | 辉达公司 | System, method, and computer program product for parallel reconstruction of a sampled suffix array |
CN103902599A (en) * | 2012-12-27 | 2014-07-02 | 北京新媒传信科技有限公司 | Fuzzy search method and fuzzy search device |
-
2014
- 2014-12-04 CN CN201410725893.6A patent/CN105653567A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060179052A1 (en) * | 2003-03-03 | 2006-08-10 | Pauws Steffen C | Method and arrangement for searching for strings |
CN102073740A (en) * | 2011-01-27 | 2011-05-25 | 农革 | String suffix array construction method on basis of radix sorting |
CN103810228A (en) * | 2012-11-01 | 2014-05-21 | 辉达公司 | System, method, and computer program product for parallel reconstruction of a sampled suffix array |
CN103902599A (en) * | 2012-12-27 | 2014-07-02 | 北京新媒传信科技有限公司 | Fuzzy search method and fuzzy search device |
CN103399907A (en) * | 2013-07-31 | 2013-11-20 | 深圳市华傲数据技术有限公司 | Method and device for calculating similarity of Chinese character strings on the basis of edit distance |
Non-Patent Citations (1)
Title |
---|
LIANG TANG 等: ""Searching Similar Segments over Textual Event Sequences"", 《IN PROCEEDINGS OF THE 22TH ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT (CIKM 2013)》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038230A (en) * | 2017-04-07 | 2017-08-11 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | A kind of short message searching method and system based on Suffix array clustering |
CN107193977A (en) * | 2017-05-26 | 2017-09-22 | 刘伟 | Data search method, artificial intelligence system, image processing system, database, search engine, communication system, computer application |
CN108920483A (en) * | 2018-04-28 | 2018-11-30 | 南京搜文信息技术有限公司 | Character string fast matching method based on Suffix array clustering |
CN111538768A (en) * | 2020-06-23 | 2020-08-14 | 平安国际智慧城市科技股份有限公司 | Data query method and device based on N-element model, electronic equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103729402B (en) | Method for establishing mapping knowledge domain based on book catalogue | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
CN107291895B (en) | Quick hierarchical document query method | |
CN110597870A (en) | Enterprise relation mining method | |
CN109325019B (en) | Data association relationship network construction method | |
CN112241481A (en) | Cross-modal news event classification method and system based on graph neural network | |
CN103310003A (en) | Method and system for predicting click rate of new advertisement based on click log | |
CN103761236A (en) | Incremental frequent pattern increase data mining method | |
CN102646095B (en) | Object classifying method and system based on webpage classification information | |
CN105653567A (en) | Method for quickly looking for feature character strings in text sequential data | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
CN108021667A (en) | A kind of file classification method and device | |
Chu et al. | Automatic data extraction of websites using data path matching and alignment | |
CN106202007B (en) | A kind of appraisal procedure of MATLAB program files similarity | |
Dias et al. | A method for the identification of collaboration in large scientific databases | |
CN113705099A (en) | Social platform rumor detection model construction method and detection method based on contrast learning | |
Gupta et al. | A classification method to classify high dimensional data | |
CN103218420A (en) | Method and device for extracting page titles | |
CN109359090A (en) | File fragmentation classification method and system based on convolutional neural networks | |
Kamanwar et al. | Web data extraction techniques: A review | |
CN104699666B (en) | Based on neighbour's propagation model from the method for library catalogue learning hierarchical structure | |
CN109684460A (en) | A kind of calculation method and system of the negative network public-opinion index based on deep learning | |
Raj et al. | Detection of Botnet Using Deep Learning Architecture Using Chrome 23 Pattern with IOT | |
CN116579344B (en) | Case main body extraction method | |
John et al. | Methods for removing noise from web pages: a review |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160608 |