CN105653567A

CN105653567A - Method for quickly looking for feature character strings in text sequential data

Info

Publication number: CN105653567A
Application number: CN201410725893.6A
Authority: CN
Inventors: 李涛; 张晟骁; 李千目; 侯君; 徐建
Original assignee: Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Current assignee: Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2016-06-08

Abstract

The invention discloses a method for quickly looking for feature character strings in a text sequential data. The method comprises the following steps of (1) acquiring a text sequence from information, namely a character string, (2) generating a suffix array, (3) searching in the suffix array and resolving according to binary search. In the third step, according to lines of the suffix matrix, search is conducted to each line; and if a field occurs for designated times in a concentrative way in binary search results, similarity of two fields is calculated and the field close to the similarity most is the candidate field. Advantages of original data in the sequence is effectively utilized, so problems of data analysis complication and slow speed due to limitation of LSH algorithm to the unordered data can be overcome; besides, after fuzzy check, delete and selection can be directly conducted; a candidate part can be directly filtered via similarity calculation; and a problem that a sub-sequence has to be fully matched for similarity search algorithm can be overcome.

Description

A kind of text sequence data are searched the method for feature string fast

Technical field

The present invention relates to a kind of method searching feature string fast, particularly in mass data continuously or the searching of the similar text being interrupted.

Background technology

Now, sequence data is quite general in actual life, comprises information biology, security of system and network connection etc. Meanwhile, similarity is also a basic fundamental during sequence data manages. Now for symbol sequence and time series data, such as DNA sequence dna, stock, network data message and video flowing, had a lot of effective method. For text search, present stage is mainly divided into two classes, a kind of is adopt the position sensing of min-hash to breathe out uncommon algorithm (Locality-SensitiveHashingwithMin-Hash hereinafter uses abbreviation LSH), also having a kind of is based on breathing out the similar section of search wished index, suffix tree and arrange with suffix tree, but in text sequence data, all there is restriction, LSH algorithm only is confined to search unordered data, search speed slowly, and data can not be ignored continuously in the middle of reality uses, because different orders has implied the whole execution process that data produce; And similar section of searching algorithm designs according to the thought of subsequence matching, it is necessary to subsequence matching just can filter out result, the mentality of designing of the present invention does not need subsequence to mate completely.

Summary of the invention

1, the object of the present invention.

The present invention spends slowly to solve prior art Chinese version sequence data similarity middling speed, and data sequence need to mate the problem causing adaptability not strong completely, and proposes a kind of method searching feature string fast.

2, the technical solution adopted in the present invention.

Text sequence data are searched a method for feature string fast, carry out in accordance with the following steps:

(1) text sequence in obtaining information, i.e. character string;

(2) generate Suffix array clustering, carry out in accordance with the following steps:

A, choose above-mentioned text sequence S=e₁��e_nWith one group of mutual individual Hash function H={h₁��h_n;

B, the Hash result sequence obtaining breathing out uncommon function are h_i(S);

C��h_i(S)=h_i(e₁)��h_i(e_n), wherein the suffix matrix of S is M_s,m=,It is h_i(S) Suffix array clustering;

(3) in Suffix array clustering, first searched by two points and carry out data decomposition, according to the line number of suffix matrix, often row is searched, if after has there is predetermined number of times in two points of result sets searched in certain field, by calculating the similarity of two fields, immediate field just thinks candidate's field.

Further in specific embodiment, described field adopts LSH algorithm, carry out the screening of candidate's field, wherein sorted by the time between candidate's field, after an event e is considered as the text event of screening, 3 events and the event before it afterwards combine, and form an event document.

Further in specific embodiment, each field adopts different Kazakhstan to wish equation, impartial region is divided according to character section, each region comprises a time point, each time point evenly distribute is breathed out uncommon function and is calculated cryptographic Hash, the cryptographic Hash after calculating, if it is determined that signature is consistent, time similarity, is Query Result.

3, the useful effect of the present invention.

(1) advantage of the original data in the sequence that the present invention effectively utilizes, avoids LSH algorithm and is only confined to unordered data, and the data analysis caused is loaded down with trivial details, slow-footed problem.

(2) the present invention directly carries out after carrying out fuzzy inquiry deleting choosing, calculates similarity direct filtration candidate's part, solves the problem that similar section of searching algorithm requirement subsequence must mate completely.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of the LSH-DOC algorithm of the present invention.

Fig. 2 is the schematic diagram of the LSH-SEP algorithm of the present invention.

Embodiment

In order to enable Patent Office auditor especially the public clearly understand the technical spirit of the present invention and useful effect, applicant will do explanation in detail below by way of example, but it not all the restriction to the present invention program to the description of embodiment, any according to being only formal and the equivalent transformation of unsubstantiality all should be considered as the technical scheme category of the present invention done by present inventive concept.

Embodiment

Definition Suffix array clustering: given one group of text sequence, S=e₁��e_nWith one group of mutual individual Hash function H={h₁��h_n, allow h_i(S) represent for Hash result sequence, h_i(S)=h_i(e₁)��h_i(e_n), wherein the suffix matrix of S is M_s,m=,It is h_i(S) Suffix array clustering. Many methods are wherein had to form suffix element group, so the suffix matrix produced can be a lot.

Suffix array clustering is searched for: be divided into two steps, first from suffix matrix, find potential similar section, then directly screened by similarity.

The uncommon function in given one group of Kazakhstan independent mutually and search sequence, generate suffix matrix. Then being searched by two points and decompose, according to the line number of suffix matrix, often row is searched. If after certain field predetermined number of times has occurred in two points of result sets searched, this field just thinks candidate's field.

Following program sequence, shows the process of candidate's field algorithm. H (i) represents that being i-th breathed out in uncommon collection of functions H breathes out uncommon function, Q_hiExpression is the Hash result that search sequence processed through h (i), SA_iBeing suffix matrix M s, in m, the i-th row SAi [j] refers to the i-th row jth element. CompareAt (Q_hi,, SAi [j]) and represent to be that two parameters carry out two points and compare, if first parameter is beaten, return 1, the 2nd parameter returns-1 greatly, and other situations return 0. Extract (Qhi, Sai, pos) is used to extract the function of candidate's field.After the algorithm computing of Fig. 1, the result of r time will be extracted, also be exactly predetermined number of times proposed above.

This basis is upper again uses LSH algorithm, directly carries out the screening of candidate's field. Two kinds that Fig. 1, Fig. 2 are LSH realize schematic diagram. Wherein, composition graphs 1, it is believed that each candidate segment is be formed by connecting " document " time, in Fig. 1 after an event e is considered as text event, 3 events after it combine with it, form one " time document ", such as Li+1, Li+2 etc.

Composition graphs 2, in order to shelf time information, we divide the uncommon equation in the Kazakhstan different to each independent section field. For example, the length of each independent section is 4, and we have 40 to breathe out uncommon function. We just distribute each time 10 functions. Then each breathes out each field that uncommon function is used to this section of index. Fig. 2 shows a sequence S and many sections of Li+1, Li+2 etc. Wherein p1 and p4 is 4 regions of each section. Each region comprises a time. Each p_j10 are had to breathe out uncommon function to calculate cryptographic Hash. If the signature of two sections is consistent, so each field time be also similar, be so sequentially just retained.

Above-described embodiment does not limit the present invention in any way, and the technical scheme that every employing is equal to replacement or the mode of equivalent transformation obtains all drops in protection scope of the present invention.

Claims

1. text sequence data are searched the method for feature string fast, it is characterised in that carry out in accordance with the following steps:

(1) text sequence in obtaining information, i.e. character string;

2. text sequence data according to claim 1 are searched the method for feature string fast, it is characterized in that: in described field, adopt LSH algorithm, carry out the screening of candidate's field, wherein sorted by the time between candidate's field, after an event e is considered as the text event of screening, 3 events and the event before it afterwards combine, and form an event document.

3. text sequence data according to claim 2 are searched the method for feature string fast, it is characterized in that: each field adopts different Kazakhstan to wish equation, impartial region is divided according to character section, each region comprises a time point, each time point evenly distribute is breathed out uncommon function and is calculated cryptographic Hash, the cryptographic Hash after calculating, if it is determined that signature is consistent, time similarity, is Query Result.