CN105653567A - Method for quickly looking for feature character strings in text sequential data - Google Patents

Method for quickly looking for feature character strings in text sequential data Download PDF

Info

Publication number
CN105653567A
CN105653567A CN201410725893.6A CN201410725893A CN105653567A CN 105653567 A CN105653567 A CN 105653567A CN 201410725893 A CN201410725893 A CN 201410725893A CN 105653567 A CN105653567 A CN 105653567A
Authority
CN
China
Prior art keywords
field
searched
suffix
similarity
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410725893.6A
Other languages
Chinese (zh)
Inventor
李涛
张晟骁
李千目
侯君
徐建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Original Assignee
Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology Changshu Research Institute Co Ltd filed Critical Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority to CN201410725893.6A priority Critical patent/CN105653567A/en
Publication of CN105653567A publication Critical patent/CN105653567A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for quickly looking for feature character strings in a text sequential data. The method comprises the following steps of (1) acquiring a text sequence from information, namely a character string, (2) generating a suffix array, (3) searching in the suffix array and resolving according to binary search. In the third step, according to lines of the suffix matrix, search is conducted to each line; and if a field occurs for designated times in a concentrative way in binary search results, similarity of two fields is calculated and the field close to the similarity most is the candidate field. Advantages of original data in the sequence is effectively utilized, so problems of data analysis complication and slow speed due to limitation of LSH algorithm to the unordered data can be overcome; besides, after fuzzy check, delete and selection can be directly conducted; a candidate part can be directly filtered via similarity calculation; and a problem that a sub-sequence has to be fully matched for similarity search algorithm can be overcome.

Description

A kind of text sequence data are searched the method for feature string fast
Technical field
The present invention relates to a kind of method searching feature string fast, particularly in mass data continuously or the searching of the similar text being interrupted.
Background technology
Now, sequence data is quite general in actual life, comprises information biology, security of system and network connection etc. Meanwhile, similarity is also a basic fundamental during sequence data manages. Now for symbol sequence and time series data, such as DNA sequence dna, stock, network data message and video flowing, had a lot of effective method. For text search, present stage is mainly divided into two classes, a kind of is adopt the position sensing of min-hash to breathe out uncommon algorithm (Locality-SensitiveHashingwithMin-Hash hereinafter uses abbreviation LSH), also having a kind of is based on breathing out the similar section of search wished index, suffix tree and arrange with suffix tree, but in text sequence data, all there is restriction, LSH algorithm only is confined to search unordered data, search speed slowly, and data can not be ignored continuously in the middle of reality uses, because different orders has implied the whole execution process that data produce; And similar section of searching algorithm designs according to the thought of subsequence matching, it is necessary to subsequence matching just can filter out result, the mentality of designing of the present invention does not need subsequence to mate completely.
Summary of the invention
1, the object of the present invention.
The present invention spends slowly to solve prior art Chinese version sequence data similarity middling speed, and data sequence need to mate the problem causing adaptability not strong completely, and proposes a kind of method searching feature string fast.
2, the technical solution adopted in the present invention.
Text sequence data are searched a method for feature string fast, carry out in accordance with the following steps:
(1) text sequence in obtaining information, i.e. character string;
(2) generate Suffix array clustering, carry out in accordance with the following steps:
A, choose above-mentioned text sequence S=e1��enWith one group of mutual individual Hash function H={h1��hn;
B, the Hash result sequence obtaining breathing out uncommon function are hi(S);
C��hi(S)=hi(e1)��hi(en), wherein the suffix matrix of S is Ms,m=,It is hi(S) Suffix array clustering;
(3) in Suffix array clustering, first searched by two points and carry out data decomposition, according to the line number of suffix matrix, often row is searched, if after has there is predetermined number of times in two points of result sets searched in certain field, by calculating the similarity of two fields, immediate field just thinks candidate's field.
Further in specific embodiment, described field adopts LSH algorithm, carry out the screening of candidate's field, wherein sorted by the time between candidate's field, after an event e is considered as the text event of screening, 3 events and the event before it afterwards combine, and form an event document.
Further in specific embodiment, each field adopts different Kazakhstan to wish equation, impartial region is divided according to character section, each region comprises a time point, each time point evenly distribute is breathed out uncommon function and is calculated cryptographic Hash, the cryptographic Hash after calculating, if it is determined that signature is consistent, time similarity, is Query Result.
3, the useful effect of the present invention.
(1) advantage of the original data in the sequence that the present invention effectively utilizes, avoids LSH algorithm and is only confined to unordered data, and the data analysis caused is loaded down with trivial details, slow-footed problem.
(2) the present invention directly carries out after carrying out fuzzy inquiry deleting choosing, calculates similarity direct filtration candidate's part, solves the problem that similar section of searching algorithm requirement subsequence must mate completely.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of the LSH-DOC algorithm of the present invention.
Fig. 2 is the schematic diagram of the LSH-SEP algorithm of the present invention.
Embodiment
In order to enable Patent Office auditor especially the public clearly understand the technical spirit of the present invention and useful effect, applicant will do explanation in detail below by way of example, but it not all the restriction to the present invention program to the description of embodiment, any according to being only formal and the equivalent transformation of unsubstantiality all should be considered as the technical scheme category of the present invention done by present inventive concept.
Embodiment
Definition Suffix array clustering: given one group of text sequence, S=e1��enWith one group of mutual individual Hash function H={h1��hn, allow hi(S) represent for Hash result sequence, hi(S)=hi(e1)��hi(en), wherein the suffix matrix of S is Ms,m=,It is hi(S) Suffix array clustering. Many methods are wherein had to form suffix element group, so the suffix matrix produced can be a lot.
Suffix array clustering is searched for: be divided into two steps, first from suffix matrix, find potential similar section, then directly screened by similarity.
The uncommon function in given one group of Kazakhstan independent mutually and search sequence, generate suffix matrix. Then being searched by two points and decompose, according to the line number of suffix matrix, often row is searched. If after certain field predetermined number of times has occurred in two points of result sets searched, this field just thinks candidate's field.
Following program sequence, shows the process of candidate's field algorithm. H (i) represents that being i-th breathed out in uncommon collection of functions H breathes out uncommon function, QhiExpression is the Hash result that search sequence processed through h (i), SAiBeing suffix matrix M s, in m, the i-th row SAi [j] refers to the i-th row jth element. CompareAt (Qhi,, SAi [j]) and represent to be that two parameters carry out two points and compare, if first parameter is beaten, return 1, the 2nd parameter returns-1 greatly, and other situations return 0. Extract (Qhi, Sai, pos) is used to extract the function of candidate's field.After the algorithm computing of Fig. 1, the result of r time will be extracted, also be exactly predetermined number of times proposed above.
This basis is upper again uses LSH algorithm, directly carries out the screening of candidate's field. Two kinds that Fig. 1, Fig. 2 are LSH realize schematic diagram. Wherein, composition graphs 1, it is believed that each candidate segment is be formed by connecting " document " time, in Fig. 1 after an event e is considered as text event, 3 events after it combine with it, form one " time document ", such as Li+1, Li+2 etc.
Composition graphs 2, in order to shelf time information, we divide the uncommon equation in the Kazakhstan different to each independent section field. For example, the length of each independent section is 4, and we have 40 to breathe out uncommon function. We just distribute each time 10 functions. Then each breathes out each field that uncommon function is used to this section of index. Fig. 2 shows a sequence S and many sections of Li+1, Li+2 etc. Wherein p1 and p4 is 4 regions of each section. Each region comprises a time. Each pj10 are had to breathe out uncommon function to calculate cryptographic Hash. If the signature of two sections is consistent, so each field time be also similar, be so sequentially just retained.
Above-described embodiment does not limit the present invention in any way, and the technical scheme that every employing is equal to replacement or the mode of equivalent transformation obtains all drops in protection scope of the present invention.

Claims (3)

1. text sequence data are searched the method for feature string fast, it is characterised in that carry out in accordance with the following steps:
(1) text sequence in obtaining information, i.e. character string;
(2) generate Suffix array clustering, carry out in accordance with the following steps:
A, choose above-mentioned text sequence S=e1��enWith one group of mutual individual Hash function H={h1��hn;
B, the Hash result sequence obtaining breathing out uncommon function are hi(S);
C��hi(S)=hi(e1)��hi(en), wherein the suffix matrix of S is Ms,m=,It is hi(S) Suffix array clustering;
(3) in Suffix array clustering, first searched by two points and carry out data decomposition, according to the line number of suffix matrix, often row is searched, if after has there is predetermined number of times in two points of result sets searched in certain field, by calculating the similarity of two fields, immediate field just thinks candidate's field.
2. text sequence data according to claim 1 are searched the method for feature string fast, it is characterized in that: in described field, adopt LSH algorithm, carry out the screening of candidate's field, wherein sorted by the time between candidate's field, after an event e is considered as the text event of screening, 3 events and the event before it afterwards combine, and form an event document.
3. text sequence data according to claim 2 are searched the method for feature string fast, it is characterized in that: each field adopts different Kazakhstan to wish equation, impartial region is divided according to character section, each region comprises a time point, each time point evenly distribute is breathed out uncommon function and is calculated cryptographic Hash, the cryptographic Hash after calculating, if it is determined that signature is consistent, time similarity, is Query Result.
CN201410725893.6A 2014-12-04 2014-12-04 Method for quickly looking for feature character strings in text sequential data Pending CN105653567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410725893.6A CN105653567A (en) 2014-12-04 2014-12-04 Method for quickly looking for feature character strings in text sequential data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410725893.6A CN105653567A (en) 2014-12-04 2014-12-04 Method for quickly looking for feature character strings in text sequential data

Publications (1)

Publication Number Publication Date
CN105653567A true CN105653567A (en) 2016-06-08

Family

ID=56480625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410725893.6A Pending CN105653567A (en) 2014-12-04 2014-12-04 Method for quickly looking for feature character strings in text sequential data

Country Status (1)

Country Link
CN (1) CN105653567A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038230A (en) * 2017-04-07 2017-08-11 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of short message searching method and system based on Suffix array clustering
CN107193977A (en) * 2017-05-26 2017-09-22 刘伟 Data search method, artificial intelligence system, image processing system, database, search engine, communication system, computer application
CN108920483A (en) * 2018-04-28 2018-11-30 南京搜文信息技术有限公司 Character string fast matching method based on Suffix array clustering
CN111538768A (en) * 2020-06-23 2020-08-14 平安国际智慧城市科技股份有限公司 Data query method and device based on N-element model, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179052A1 (en) * 2003-03-03 2006-08-10 Pauws Steffen C Method and arrangement for searching for strings
CN102073740A (en) * 2011-01-27 2011-05-25 农革 String suffix array construction method on basis of radix sorting
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array
CN103902599A (en) * 2012-12-27 2014-07-02 北京新媒传信科技有限公司 Fuzzy search method and fuzzy search device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179052A1 (en) * 2003-03-03 2006-08-10 Pauws Steffen C Method and arrangement for searching for strings
CN102073740A (en) * 2011-01-27 2011-05-25 农革 String suffix array construction method on basis of radix sorting
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array
CN103902599A (en) * 2012-12-27 2014-07-02 北京新媒传信科技有限公司 Fuzzy search method and fuzzy search device
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIANG TANG 等: ""Searching Similar Segments over Textual Event Sequences"", 《IN PROCEEDINGS OF THE 22TH ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT (CIKM 2013)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038230A (en) * 2017-04-07 2017-08-11 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of short message searching method and system based on Suffix array clustering
CN107193977A (en) * 2017-05-26 2017-09-22 刘伟 Data search method, artificial intelligence system, image processing system, database, search engine, communication system, computer application
CN108920483A (en) * 2018-04-28 2018-11-30 南京搜文信息技术有限公司 Character string fast matching method based on Suffix array clustering
CN111538768A (en) * 2020-06-23 2020-08-14 平安国际智慧城市科技股份有限公司 Data query method and device based on N-element model, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN103729402B (en) Method for establishing mapping knowledge domain based on book catalogue
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN107291895B (en) Quick hierarchical document query method
CN110597870A (en) Enterprise relation mining method
CN109325019B (en) Data association relationship network construction method
CN112241481A (en) Cross-modal news event classification method and system based on graph neural network
CN103310003A (en) Method and system for predicting click rate of new advertisement based on click log
CN103761236A (en) Incremental frequent pattern increase data mining method
CN102646095B (en) Object classifying method and system based on webpage classification information
CN105653567A (en) Method for quickly looking for feature character strings in text sequential data
CN104598536B (en) A kind of distributed network information structuring processing method
CN108021667A (en) A kind of file classification method and device
Chu et al. Automatic data extraction of websites using data path matching and alignment
CN106202007B (en) A kind of appraisal procedure of MATLAB program files similarity
Dias et al. A method for the identification of collaboration in large scientific databases
CN113705099A (en) Social platform rumor detection model construction method and detection method based on contrast learning
Gupta et al. A classification method to classify high dimensional data
CN103218420A (en) Method and device for extracting page titles
CN109359090A (en) File fragmentation classification method and system based on convolutional neural networks
Kamanwar et al. Web data extraction techniques: A review
CN104699666B (en) Based on neighbour's propagation model from the method for library catalogue learning hierarchical structure
CN109684460A (en) A kind of calculation method and system of the negative network public-opinion index based on deep learning
Raj et al. Detection of Botnet Using Deep Learning Architecture Using Chrome 23 Pattern with IOT
CN116579344B (en) Case main body extraction method
John et al. Methods for removing noise from web pages: a review

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160608