WO2015025751A1

WO2015025751A1 - Frequent sequence enumeration device, method, and recording medium

Info

Publication number: WO2015025751A1
Application number: PCT/JP2014/071136
Authority: WO
Inventors: 幸貴楠村; 優輔村岡
Original assignee: 日本電気株式会社
Priority date: 2013-08-23
Filing date: 2014-08-05
Publication date: 2015-02-26

Abstract

In order to efficiently carry out a frequency calculation process, a frequent sequence enumeration device comprises: a suffix array storage unit which, when all suffixes for sequence data are arranged in dictionary order, stores a suffix array whereby it is possible to reference a location on the sequence data from a location on a dictionary; a suffix array frequency calculation unit which, taking a set of locations on the suffix array as input, counts the number of times a character is included in the suffixes referenced by the suffix array, using the fact that the suffixes are sequenced; and a suffix array mapping unit which, taking a set of locations in the suffix array and a specified character as input, returns an appearance location on the suffix array with respect to the inputted character by adding one to the value within the suffix array and referencing an inverted array of the suffix array.

Description

Frequent sequence enumeration apparatus, method and recording medium

The present invention relates to an apparatus for enumerating high-frequency series from a data string such as text data, purchase data, DNA (deoxyribonucleic acid) data, and the like.

An enumeration apparatus for frequent sequences is known as a technique for discovering patterns useful for users from a large-scale database. Here, the “frequency series enumeration device” refers to a high-frequency partial series from a database (series database) composed of data (series data) having a meaningful order relationship such as purchase data, DNA data, and tests. It is a device to enumerate. “Sequence data” is an order-related character string. Here, the “character” is a symbol having an arbitrary number of types.
Thus, for example, in the case of purchase data, an item (product) may be a character, and in the case of text data, one word may be a character.
“Partial sequence” refers to a partial sequence that appears in sequence in sequence data. For example, a partial series of 4-character series data ABCD is:
AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD.
As a method for realizing enumeration of frequent sequences, a method using PrefixSpan is known (see Non-Patent Document 1). PrefixSpan is an algorithm that repeats frequency calculation and processing called database mapping. The database is abbreviated as DB.
FIG. 1 is a block diagram showing a configuration of an enumeration apparatus 10 using PrefixSpan. The enumeration apparatus 10 includes a sequence DB management unit 11, a frequency calculation unit 12, a control unit 13, and a DB mapping unit 14.
For example, it is assumed that there are four sequences {ACD, ABC, CBA, AAB} in the sequence DB in the sequence

DB management unit

11 and 2 is specified as a frequency threshold. In this case, the enumeration apparatus 10 first performs frequency calculation using the frequency calculation unit 12. In this process, the frequency of each character in the series DB is counted. In this example, A appears 5 times, B appears 3 times, C appears 3 times, and D appears once. Among these, since the series data {A, B, C} has a frequency of 2 or more, the control unit 13 outputs three series of A, B, and C. Furthermore, DB mapping using characters with a frequency of 2 or more is performed on the first series DB using the control unit 13 and the DB mapping unit 14.
“DB mapping using the character x” refers to a process of extracting a record including the character x from the series DB and extracting a character string (suffix) after the character x.
For example, if DB mapping using the letter A is performed on the sequence DB {ACD, ABC, CBA, AAB}, a sequence DB consisting of three sequences {CD, BC, AB} is newly obtained.
And this enumeration apparatus 10 performs frequency calculation with respect to this series DB, and obtains the result that A is once, B is twice, and C is twice. As a result, the control unit 13 outputs AB and AC on the assumption that AB appears twice and AC appears twice. This process is recursively performed by a depth-first search, and the frequency in the original sequence DB. All series of 2 or more are listed.
PrefixSpan performs frequency calculation by repeating frequency calculation and processing called DB mapping.
Non-Patent Document 2 discloses a suffix arrangement.
Prior art documents related to the present invention are also known. For example, Patent Document 1 discloses a “sequence pattern extraction apparatus and method” capable of improving the efficiency of processing related to projection. In the sequence pattern extraction apparatus disclosed in Patent Document 1, projection is performed by associating and storing a projection position for specifying an item to be extracted during projection with respect to items constituting sequence data. The position to perform is acquired efficiently, and the process concerning projection is performed efficiently.

JP 2009-169850 A

However, the first problem of the enumeration apparatus 10 disclosed in Non-Patent Document 1 is that the processing time is long. In the conventional method, the repetition frequency calculation process is performed after the DB mapping. However, when the number of series data increases, these processing costs increase.
In addition, the said nonpatent literature 2 only discloses the suffix arrangement | sequence.
On the other hand, Patent Literature 1 merely discloses a technique for storing a projection position in association with each other for specifying an item to be extracted at the time of projection.
[Object of the invention]
An object of the present invention is to realize a high-speed frequent sequence enumeration apparatus that efficiently performs frequency calculation processing.

The frequent sequence enumeration apparatus according to the present invention provides a sorted array that makes it possible to refer to a position on the original series data from a position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order. Using the fact that the sorted array is ordered by using the sorted array storage unit to store; the sorted array and the set of positions on the sorted array as input, and the number of characters existing on the original series data Sorted array frequency calculator that counts; sets of positions in the sorted array and a specific character as input, adds or subtracts the value in the sorted array, and adds a suffix or prefix adjacent to the input character A sorted array mapping unit that refers to the appearance positions and narrows down a set of target positions.

In the present invention, the frequency calculation process can be performed efficiently.

FIG. 1 is a block diagram showing the structure of a related enumeration apparatus disclosed in Non-Patent Document 1.
FIG. 2 is a block diagram showing the configuration of the frequent sequence listing apparatus according to the first embodiment of the present invention.
FIG. 3 is a diagram for explaining the suffix arrangement.
FIG. 4 is a diagram showing an example of a suffix array.
FIG. 5 is a diagram showing a processing flow of the frequent sequence listing apparatus shown in FIG.
FIG. 6 is a diagram showing a processing flow of the frequency calculation processing f2 in FIG.
FIG. 7 is a diagram showing a processing flow of the binary search processing c03 in FIG.
FIG. 8 is a diagram showing a flow of the SA mapping processing f6 in FIG.
FIG. 9 is a diagram showing a processing flow of suffix extension in FIG.
FIG. 10 is a block diagram showing a configuration of a frequent sequence listing apparatus according to the second embodiment of the present invention.
FIG. 11 is a diagram showing another example of the suffix array.
FIG. 12 is a diagram illustrating an example of the output of the frequency calculation unit.
FIG. 13 is a diagram schematically illustrating a target suffix list obtained by SA mapping of the target sequence A.
FIG. 14 is a diagram illustrating another example of the output of the frequency calculation unit.
FIG. 15 is a diagram schematically showing a target suffix list obtained by SA mapping of the target sequence AA.
FIG. 16 is a diagram schematically illustrating a target suffix list obtained by SA mapping of the target sequence AB.

In order to facilitate understanding of the present invention, the suffix arrangement will be described. The “suffix array” is the data structure shown in Non-Patent Document 2 above.
The suffix array is an array that holds the positions when all suffixes appearing in a character string are arranged in dictionary order.
FIG. 3 shows an example of the suffix array of the character string abraca. The leftmost list in FIG. 3 represents all suffixes for the character string abraca. In this figure, for the sake of convenience, the character $ that comes first in the dictionary order is given at the end. Here, the symbol S _i is assigned to each suffix. S _i is a symbol that means a suffix after the i-th character. The list in the center of FIG. 3 is a list obtained by changing the list on the left in lexicographic order. Further, the list of integer values on the right side of FIG. 3 is obtained by extracting the suffix of the central suffix list, and this becomes a suffix array.
The suffix array indicates where the suffixes at arbitrary positions in the dictionary order exist on the original character string. The suffix array is a data structure mainly used for document search, and it is known that an arbitrary partial character string can be extracted by a log (character string). Hereinafter, the suffix arrangement may be simply abbreviated as SA.
In addition, by holding the reverse arrangement of the suffix array, the suffix array existing to the right of the suffix array can be extracted. Note that the reverse array V of array A is A [V [i]] = i
This is an array in which
The suffix sequence is SA [i], and the reverse sequence is INV [i].
At this time, the suffix array SA [i] represents the character number in the original character string that the i-th suffix in the dictionary order appears.
The inverse array INV [i] represents the number of the suffix that appears in the original character string in the dictionary order.
Using these two sequences, the suffix to the right of the i-th suffix on the suffix array is
INV [SA [i] +1]
Can be calculated by calculating.
For example, in the SA array of FIG. 3, SA [2] (assuming that the array subscript starts from 0) is 1 for abraca $ that is third from the top in the dictionary order. After calculating the position S ₂ plus 1 to this value appears from the reverse sequence, can be calculated 4, and the suffix Braca $ position one right.
In addition, the suffix array has a property that a suffix set to the right of a suffix having the same prefix appears in the same order on the suffix array. For example, FIG. 4 shows a suffix array for a character string abracadara.
For example, in this suffix set, the suffixes starting with the same prefix a are the 1st to 5th suffixes, which are the 10th, 7th, 0th, 3rd and 5th on the original character string. The suffix set to the right of this list is the set of the 11th, 8th, 1st, 4th and 6th suffixes on the original character string. These appear in the 0th, 6th, 7th, 8th, and 9th in dictionary order, respectively, and it is understood that five suffixes are arranged in the same order in the dictionary order.
A suffix array is a lexicographical sort of all suffixes, and suffixes with the same prefix are compared with the next character after that prefix. In this example, since a is the same, since the comparison is made in the order of the second and subsequent characters, this order is always maintained.
Hereinafter, embodiments of the present invention will be described in detail.
[First Embodiment]
Referring to FIG. 2, the frequent sequence listing apparatus 20 according to the first embodiment of the present invention includes an SA creation unit 21, an SA storage unit 22, an SA frequency calculation unit 23, an SA mapping unit 24, and a control. Part 25.
The SA creation unit 21 receives a series data set as input, and generates a suffix array, a reverse array of the suffix array, and original series data while inserting a delimiter of the series data.
The SA storage unit 22 stores the suffix array created by the SA creation unit 21, the reverse arrangement of the suffix array, and the original series data.
The SA frequency calculation unit 23 refers to the data in the SA storage unit 22, calculates the frequency using the fact that the data is arranged in the dictionary order, and inputs the result to the control unit 25.
The SA mapping unit 24 calculates a suffix pointer to be referred to next based on the pointer specified by the control unit 25.
The SA storage unit 22 stores a suffix array that makes it possible to refer to the position on the series data from the position on the dictionary when all the suffixes about the series data are arranged in dictionary order.
The SA frequency calculation unit 23 receives a set of positions on the suffix array as input, and counts the number of characters included in the suffix referenced by the suffix array by using the fact that the suffixes are ordered.
The SA mapping unit 24 takes a set of positions in the suffix array and a specific character as input, adds 1 to the value in the suffix array, and refers to the reverse array of the suffix array, thereby inputting the input character Returns the position on the suffix array for.
Also, the SA frequency calculation unit 23 takes the suffix array and a set of positions on the suffix array as inputs, and uses the fact that the suffix array is ordered to determine the number of characters existing on the original series data. When counting, whether to perform a binary search is switched depending on the frequency.
Next, a processing flow of the frequent sequence listing apparatus 10 according to the first embodiment will be described with reference to FIG.
This processing flow starts when a set of series data is input to the SA creation unit 21.
First, the SA creation unit 21 constructs a suffix array based on a set of series data (step f1).
This process is performed in the following procedure.
First, the SA creating unit 21 creates one character string by connecting the series data while inserting a delimiter.
For example,
{CAABC, ABCB, CABC, ABBCA}
For the four series data, the character string created with “%” as the delimiter and “$” as the delimiter is “CAABC% ABCB% CABC% ABBCA% $”
It becomes.
Next, the SA creation unit 21 constructs a suffix array SA based on the created character string. Then, the SA creation unit 21 constructs a reverse array INV based on the suffix array SA.
The SA creation unit 21 stores three of the combined character string T, suffix array SA, and reverse array INV in the SA storage unit 21.
When three pieces of information are stored in the SA storage unit 22, the SA frequency calculation unit 23 performs frequency calculation processing with reference to the SA (step f2).
This process is shown in FIG.
The frequency calculation process is processed with the suffix array SA, the character string T, the target suffix list P, and the target series TS as inputs. Among these, the target suffix list P is a list representing the subscripts of the suffix array SA. The target sequence TS represents the sequence that is the source of SA mapping at that time.
At the first execution, there is no mapping source, and all suffixes are targeted. Therefore, the id list of 0 to the total string length and the empty string "" are input. Become.
Except for the first time, the SA mapping unit 24 sets values stored in the SA storage unit 22.
When this process is started, the SA frequency calculation unit 23 first initializes a variable s representing a pointer on P as 0 (initialization process c01).
Next, the SA frequency calculation unit 25 extracts the character corresponding to the sth subscript of P by T [P [SA [s]]] and sets it as c (character extraction processing c02).
Then, the SA frequency calculation unit 23 extracts a position where the first character is not c on P by binary search, and sets it as e (binary search processing c03).
The binary search process c03 is performed by the process of FIG.
The binary search process c03 receives the character c and the target suffix list P as input, and executes the final position where the first character is c in the suffix in P by the binary search algorithm.
In this process, first, the search range of the binary search is started with a length of 0 to P. Therefore, the SA frequency calculation unit 23 sets the pointers l and u representing the ranges to the lengths of 0 and P, respectively (step b01). .
Next, the SA frequency calculation unit 23 calculates a position m that is in the middle of the range (step b02), extracts a character at that position by T [SA [P [m]]], and sets it to t (step b02). b03).
Then, the SA frequency calculation unit 23 performs a comparison process comp (c, t) between c and t (step b04).
The comparison process comp (a, b) compares the character a and the character b in the dictionary order, 0 if the character a and the character b match, 1 if the character a <character b in the dictionary order, 1 in the dictionary order If character a> character b, the process returns -1.
When the result of comp (c, t) is 0 or more, since the position where the first character is not c exists after the search range, the SA frequency calculation unit 23 sets the start position to make the latter half of the search range. m is substituted for l (step b05).
Otherwise, the SA frequency calculation unit 23 substitutes m for the end position u to make the first half of the search range (step b06).
Next, the SA frequency calculation unit 23 performs an end determination (step b07). In the end determination, the SA frequency calculation unit 23 checks whether l + 1 is equal to u in order to determine whether the search range is narrowed down to one line. If so, the SA frequency calculation unit 23 outputs l as the last position where the character c appears, otherwise, the SA frequency calculation unit 23 returns to step b02 and performs the same processing.
Returning to FIG. 6, as a result, the position where the character c is arranged on the suffix can be determined as s to e, so the SA frequency calculation unit 23 calculates the frequency of the character c as (es), and the character c And the frequency (es), the start position s, and the end position e are output (output process c04).
At this time, if the character c is% or $, it is not output.
Further, since the SA frequency calculation unit 23 calculates the frequency of the next character, s is set to e (processing c05).
Finally, the SA frequency calculation unit 23 determines whether or not the pointer s has reached the end of P (determination process c06).
That is, s = P. If it is length, the SA frequency calculation unit 23 determines that it has reached the end and ends the process.
Otherwise, the SA frequency calculation unit 23 returns to the process c02 and performs the same process.
Note that the frequency calculation process normally requires order 0 (P. length) of the target suffix list P, but since this algorithm uses a binary search, 0 is assumed when the number of character types is W. (W · log (P. length)) processing time can be implemented.
However, when the number of target suffix lists P is smaller than the number of character types W, it may operate faster if the frequency of each character is examined in order from the top.
For this reason, the SA frequency calculation unit 23 performs P.P. A threshold is set in advance for the length. If the length is small to some extent, the frequency calculation may be simply performed.
Returning to FIG. 5, the SA frequency calculation unit 23 inputs the resulting character, frequency, start position, end position list, target suffix list P, and target sequence TS to the control unit 25.
Next, the control unit 25 performs an output determination process f3 for determining whether or not a character having a frequency equal to or higher than a preset frequency threshold is included based on the frequency list of each character.
When a character equal to or higher than the frequency threshold is included, the control unit 25 outputs the character string connected to the target sequence TS and adds the character to the stack to the target sequence TS (processing f4).
Then, the control unit 25 takes out the top series of the stack (processing f5).
At this time, if the stack is empty, the processing is terminated.
When a sequence is obtained from the stack, the control unit 25 inputs the target suffix list and the sequence to the SA mapping unit 24.
When the sequence S and the target suffix list are input to the SA mapping unit 24, the SA mapping unit 24 performs DB mapping processing (processing f6).
The SA mapping unit 24 receives the sequence S and the target suffix list as input, and refers to the data in the suffix array in the SA storage unit 22 to suffix the suffix appearing after the last character c in the sequence S. An array is constructed and the target suffix list P is updated.
This process will be described with reference to FIG.
First, the SA mapping unit 24 creates an inverse array P_INV of the target suffix list P (step S11).
This process is calculated by the following formula for the target suffix list.
P_INV [P [i]] = i
Next, the SA mapping unit 24 adds a suffix array to the suffixes included in the start position and the end position of the suffix prefixed with the last character c of the sequence S in the output target suffix list P. Is expanded (step S12).
This process will be described with reference to FIG.
This process is performed using the start position s, end position e, target suffix list P, and target suffix list P_INV of the character c.
The SA mapping unit 24 performs the following process for each suffix i (s <= i <e) starting with c.
1. i is substituted into the pointer k (step S21).
2. The position on the entire suffix of the character on the right is calculated by n = INV [SA [P [k]] + 1] (step S22).
3. The character at that position is extracted by T [SA [n]] and it is determined whether or not it is a delimiter character “%” (step S23).
4). If so, the process ends (step S24).
5. Otherwise, the position of the position on the target suffix list is calculated by P_INV [SA [P [k]] + 1 (step S25), the position is output, and the process returns to step S22.
With the above processing, all the next positions of the character string starting with the character c in the current target suffix list can be extracted.
Returning to FIG. 8, the SA mapping unit 24 makes the position obtained by this processing a new target suffix list and stores it in the SA storage unit 22. Further, the SA mapping unit 24 stores the sequence S as a target sequence (step S13).
Returning to FIG. 5, when the target suffix list in the SA storage unit 22 is updated, the SA frequency calculation unit 23 performs frequency calculation processing based on the new target suffix list P and the target series TS (frequency calculation processing). f2).
The frequent sequence listing device 20 operates by repeating this process (f2 to f6) until the stack in the control unit 25 becomes empty.
Each unit of the frequent sequence listing device 20 may be realized by using a combination of hardware and software. In the form of a combination of hardware and software, enumeration programs are expanded in RAM (random access memory), and each part is realized as various means by operating hardware such as a control unit (CPU) based on the program. To do. Further, the program may be recorded on a recording medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
To describe the above-described embodiment in another expression, an information processing device (computer) that operates as a frequent sequence enumeration device 20 is based on an enumeration program expanded in a RAM, an SA creation unit 21, an SA storage unit 22, This can be realized by operating as the SA frequency calculation unit 23, the SA mapping unit 24, and the control unit 25.
Next, the effect of the frequent sequence listing device 20 of the first embodiment will be described.
In the frequent sequence enumeration apparatus 20 of the first embodiment, DB mapping and frequency calculation processing are performed using a suffix array and its inverse array. If a suffix array is used, a list of suffixes existing to the right of a certain character can be efficiently extracted, so that it is not necessary to copy repeated data. Since suffixes are always arranged in dictionary order, the frequency can be calculated efficiently using this property.
[Second Embodiment]
The enumeration apparatus 20 according to the first embodiment uses a suffix array and its reverse array, refers to two arrays to move to the right one character, and designates INV [SA [i] +1]. Moved, but you can use prefix sequences in a similar way. Hereinafter, the prefix array may be simply abbreviated as PA.
The “prefix array” is obtained by rearranging all the prefixes in a given character string in dictionary order. In this case, when the prefix array PA and its inverse array INV are created, it is possible to know the position of the character on the right one in the dictionary by referring to INV [PA [i] -1].
Referring to FIG. 10, a frequent sequence listing device 20A according to the second embodiment of the present invention includes a PA creation unit 21A, a PA storage unit 22A, a PA frequency calculation unit 23A, a PA mapping unit 24A, and a control. 25A.
The PA creation unit 21A receives a series data set as input, and generates a prefix array, a reverse array of the prefix array, and original series data while inserting a series data break.
The PA storage unit 22A stores the prefix array created by the PA creation unit 21A, the reverse array of the prefix array, and the original series data.
The PA frequency calculation unit 23A refers to the data in the PA storage unit 22A, performs frequency calculation using the fact that the data is arranged in dictionary order, and inputs the result to the control unit 25A.
The PA mapping unit 24A calculates a prefix pointer to be referred to next based on the pointer designated by the control unit 25A.
The PA storage unit 22A stores a prefix array that makes it possible to refer to the position on the series data from the position on the dictionary when all the prefixes about the series data are arranged in the dictionary order.
The PA frequency calculation unit 23A receives a set of positions on the prefix array as an input, and counts the number of characters included in the prefix referenced by the prefix array using the fact that the prefixes are ordered.
The PA mapping unit 24A takes a set of positions in the prefix array and a specific character as input, subtracts 1 from the value in the prefix array, and refers to the reverse array of the prefix array, thereby inputting the input character. Returns the position on the prefix array for.
Further, the PA frequency calculation unit 23A takes the prefix array and a set of positions on the prefix array as inputs, and uses the fact that the prefix array is ordered to determine the number of characters existing on the original series data. When counting, whether to perform a binary search is switched depending on the frequency.
The operations of the PA creation unit 21A, PA storage unit 22A, PA frequency calculation unit 23A, PA mapping unit 24A, and control unit 25A are the SA creation unit 21, SA storage unit 22, and SA frequency calculation in the first embodiment, respectively. Since it is the same as that of the unit 23, the SA mapping unit 24, and the control unit 25, the detailed operation description thereof will be omitted.
Note that each unit of the frequent sequence listing device 20A may be realized by using a combination of hardware and software. In the form of a combination of hardware and software, enumeration programs are expanded in RAM (random access memory), and each part is realized as various means by operating hardware such as a control unit (CPU) based on the program. To do. Further, the program may be recorded on a recording medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
To describe the above-described embodiment in another expression, an information processing device (computer) that operates as a frequent-sequence enumeration device 20A is based on a PA creation unit 21A, a PA storage unit 22A, This can be realized by operating as the PA frequency calculation unit 23A, the PA mapping unit 24A, and the control unit 25A.
Next, the effect of the frequent sequence listing apparatus 20A of the second embodiment will be described.
In the frequent sequence enumeration apparatus 20A of the second embodiment, DB mapping and frequency calculation processing are performed using a prefix array and its reverse array. If a prefix array is used, a list of prefixes existing to the right of a certain character can be efficiently extracted, so there is no need to copy repeated data. Moreover, since the prefixes are always arranged in the dictionary order, the frequency can be calculated efficiently using this property.
The “suffix array” and “prefix array” are collectively referred to as “sorted array”.
Accordingly, the frequent sequence enumeration apparatus that represents the first embodiment and the second embodiment in a high-level concept is not shown, but the sorted array creation unit, the sorted array storage unit, and the sorted It consists of an array frequency calculation unit, a sorted array mapping unit, and a control unit.
The sorted array storage unit makes it possible to refer to the position on the original series data from the position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order.
The sorted array frequency calculation unit receives the set of the sorted array and the position on the sorted array as input, and counts the number of characters existing in the original series data using the fact that the sorted array is ordered. .
The sorted array mapping unit takes a set of positions in the sorted array and a specific character as input, and adds or subtracts the value in the sorted array, and the appearance position of the suffix or prefix adjacent to the input character To narrow down the set of target positions.
In addition, the sorted array frequency calculation unit uses a sorted array and a set of positions on the sorted array as inputs, and uses the number of characters existing in the original series data to order the sorted arrays. When counting, whether to perform a binary search is switched depending on the frequency.

Next, an example of the frequent sequence listing apparatus 10 according to the first embodiment shown in FIG. 2 will be described.
Here, it is assumed that the following four sequences are obtained as the sequence database at the frequency threshold 2.
CAABCC
ABCB
CABC
ABBCA
At this time, the SA creation unit 21 inserts the delimiter character% and the trailing character $, and the next character string “CAABC% ABCB% CABC% ABBCA% $”.
And the suffix array of FIG. 11 is created (SA creation processing f1 of FIG. 5).
In FIG. 11, each suffix is shown for easy understanding, but it is assumed that the suffix itself is not held in practice.
When the suffix array SA of FIG. 11 is obtained, the SA frequency calculation unit 23 performs a binary search process and calculates the frequency of each character (frequency calculation process f2 of FIG. 5).
In this case, the target suffix list P is 0 to 23, and the target series TS is an empty character string “”.
The result is shown in FIG. In FIG. 12, the frequency and the start position and end position on all SAs are output for the characters A, B, and C, respectively.
The SA frequency calculation unit 23 outputs 0 to 23 for the target suffix list P, the empty character string "" for the target series TS, and the data of FIG.
Next, the control unit 25 examines the data in FIG. 12, outputs characters A, B, and C (output determination processing f3 in FIG. 5), and adds “C”, “B”, and “A” to the stack. (Step f4 in FIG. 5).
In this state, the control unit 25 further extracts A at the top of the stack (step f5 in FIG. 5), and inputs the target suffix list P (0 to 23) and the sequence (A) to the SA mapping unit 24.
The SA mapping unit 24 performs SA mapping based on these (SA mapping processing f6 in FIG. 5).
In this process, first, the SA mapping unit 24 creates an inverse array for P (step S11 in FIG. 8).
Note that P_INV [i] = P [i] because the current P is P [i] = i.
Then, the SA mapping unit 24 performs a suffix array expansion process (step S12 in FIG. 8). In this process, the SA mapping unit 24 creates a suffix array list starting with A in the current P.
Therefore, the SA mapping unit 24 extends rightward from the 6th to 12th positions in FIG.
For example, the next position n for i = 6 is
n = INV [SA [P [6]] + 1] = INV [SA [6] +1] = INV [22] = 1
It becomes.
Since the suffix to the right of i = 6 is i = 1 and T [SA [1]] = '%', the subsequent processing is not performed.
Next, an example for i = 7 will be described.
The next position n for i = 7 is
n = INV [SA [P [7]] + 1] = INV [SA [7] +1] = INV [3] = 10
It becomes.
T [SA [10]] = A. At this time, the SA mapping unit 24
k = P_INV [10] = 10
Is output.
By repeating the same process, {6, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 23} is obtained as a target suffix array list based on the target sequence A.
Is output to the SA storage unit 22.
FIG. 13 shows a pseudo partial suffix arrangement diagram in which only these lines are extracted.
Next, the SA frequency calculation unit 23 performs frequency calculation based on these data (frequency calculation processing f2 in FIG. 5).
The result is shown in FIG. As can be seen from FIG. 14, as the characters following the target series A, A is 2 times, B is 6 times, and C is 6 times. Therefore, the control means 25 combines these and the target series A, and outputs AA, AB, and AC.
Further, the control unit 25 adds these to the stack {C, B}, and adds the stack to {C, B, AC, AB, AA}.
(Output determination process f3 in FIG. 5).
Then, the control unit 25 takes out the uppermost series AA and performs the same process again.
That is, the target suffix list of FIG. 15 is obtained by the next SA mapping.
However, since B is included only once and C is included only once, the control unit 25 does not output anything and performs processing for the next sequence AB in the stack.
In this case, the result of SA mapping is as shown in FIG. 16, and outputs of ABB 3 times and ABC 4 times are obtained.
The enumeration apparatus 20 performs such recursive calculation using a suffix array instead of a DB map. The calculation time of the mapping itself is the data length as in the conventional case, but since the frequency calculation after mapping can be performed by performing a binary search, it can be operated efficiently.
While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Supplementary note 1) Sorted array storage for storing a sorted array that allows reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order And
Sorted array frequency that takes the sorted array and a set of positions on the sorted array as input, and counts the number of characters existing on the original series data by using the fact that the sorted array is ordered A calculation unit;
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A sorted array mapping section that narrows down the set of target positions;
An enumeration apparatus for frequent sequences.
(Supplementary Note 2) The sorted array frequency calculation unit uses the sorted array and a set of positions on the sorted array as inputs, and the sorted array orders the number of characters existing on the original series data. The enumeration apparatus for frequent sequences according to appendix 1, wherein when enumerating what is being performed, the binary search is used depending on the frequency.
(Supplementary note 3) The sorted array consists of a suffix array,
The apparatus further comprises a suffix array creation unit that generates the suffix array, the reverse array of the suffix array, and the original series data while inserting the series data delimiter using the series data set as an input. The frequent sequence listing apparatus according to 1 or 2.
(Supplementary Note 4) The sorted array storage unit is a suffix array storage unit that stores the suffix array, the reverse array of the suffix array, and the original series data generated by the suffix array generation unit. The frequent-sequence enumeration apparatus according to appendix 3, comprising:
(Supplementary note 5) The sorted array consists of a prefix array,
The system further comprises: a prefix array creating unit that generates the prefix array, the reverse array of the prefix array, and the original series data while receiving the series data set as an input and inserting a break of the series data The frequent sequence listing apparatus according to 1 or 2.
(Supplementary Note 6) The sorted array storage unit is a prefix array storage unit that stores the prefix array, the reverse array of the prefix array, and the original series data generated by the prefix array creation unit. The frequent-sequence enumeration apparatus according to appendix 5, comprising:
(Supplementary note 7) A method of enumerating a high-frequency series from within a data string using an enumeration device,
A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Steps,
A calculation step of counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping step for narrowing down the set of positions of interest;
A method for enumerating frequent sequences.
(Supplementary Note 8) In the calculation step, the sorted array is ordered with the number of characters existing on the original series data, with the sorted array and a set of positions on the sorted array as inputs. 8. The frequent sequence enumeration method according to appendix 7, in which whether or not to perform a binary search is switched according to frequency when counting using.
(Supplementary note 9) The sorted array consists of a suffix array,

Appendix

7 or 8 further comprising a creation step of generating the suffix array, the reverse array of the suffix array, and the original series data while inputting the series data set as an input and inserting a break of the series data The enumeration method of the frequent series described in.
(Supplementary Note 10) In the storage step, the suffix array generated in the creation step, the reverse array of the suffix array, and the original sequence data are generated in the suffix array storage section which is the sorted array storage section. The frequent sequence listing method according to appendix 9, wherein:
(Supplementary note 11) The sorted array consists of a prefix array,

Supplementary note

7 or 8 further comprising a creation step of generating the prefix array, the reverse array of the prefix array, and the original series data while inputting the series data set as an input and inserting a break of the series data The enumeration method of the frequent series described in.
(Supplementary Note 12) In the storage step, the prefix array, the reverse array of the prefix array, and the original series data generated in the creation step are stored in the prefix array storage section which is the sorted array storage section. And the frequent sequence listing method according to attachment 11.
(Supplementary note 13) A computer-readable recording medium recording a program for causing a computer to list a high-frequency sequence from a data string, wherein the program is stored in the computer,
A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Procedure and
A calculation procedure for counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping procedure for narrowing down the set of target positions;
A computer-readable recording medium for executing
(Supplementary Note 14) In the calculation procedure, the sorted array orders the number of characters existing on the original series data by inputting the sorted array and a set of positions on the sorted array to the computer. 14. The computer-readable recording medium according to appendix 13, wherein, when counting is performed, whether to perform a binary search is switched depending on frequency.
(Supplementary note 15) The sorted array consists of a suffix array,
The program has the creation procedure of generating the suffix array, the reverse array of the suffix array, and the original series data while inserting the series data delimiter into the computer while inputting the series data set. The computer-readable recording medium according to

appendix

13 or 14, further executed.
(Supplementary Note 16) In the storage procedure, the suffix array generated in the creation step, the reverse array of the suffix array, and the suffix array generated in the suffix array storage unit that is the sorted array storage unit, The computer-readable recording medium according to appendix 15, which stores the original series data.
(Supplementary note 17) The sorted array consists of a prefix array,
The program has the creation procedure of generating the prefix array, the reverse array of the prefix array, and the original series data while inserting the series data delimiter into the computer while inputting the series data set. The computer-readable recording medium according to

appendix

13 or 14, further executed.
(Supplementary Note 18) In the storage procedure, in the computer, the prefix array storage unit that is the sorted array storage unit, the prefix array generated in the creation step, the reverse array of the prefix array, 18. The computer-readable recording medium according to appendix 17, which stores the original series data.

The present invention can be used to quickly calculate characteristic patterns that frequently appear in analysis of purchase logs, analysis of DNA, analysis of text data, and log data.

20, 20A Frequent sequence listing device 21 SA creation unit 21A PA creation unit 22 SA storage unit (sorted array storage unit)
22A PA storage unit (sorted array storage unit)
23 SA frequency calculator (sorted array frequency calculator)
23A PA frequency calculator (sorted array frequency calculator)
24 SA mapping part (sorted array mapping part)
24A PA mapping part (sorted array mapping part)
25, 25A Control Unit This application claims priority based on Japanese Patent Application No. 2013-173134 filed on August 23, 2013, the entire disclosure of which is incorporated herein.

Claims

A sorted array storage unit for storing a sorted array that enables reference to a position on the original series data from a position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order;
Sorted array frequency that takes the sorted array and a set of positions on the sorted array as input, and counts the number of characters existing on the original series data by using the fact that the sorted array is ordered A calculation unit;
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A sorted array mapping section that narrows down the set of target positions;
An enumeration apparatus for frequent sequences.
The sorted array frequency calculation unit has the sorted array and the set of positions on the sorted array as inputs, and the sorted array is ordered by the number of characters existing in the original series data. The enumeration apparatus for frequent sequences according to claim 1, wherein, when counting by using, switching whether to perform a binary search according to frequency is used.
The sorted array comprises a suffix array;
The system further comprises a suffix array creation unit for generating the suffix array, the reverse array of the suffix array, and the original series data while inserting a series data set as an input and inserting a break of the series data. Item 3. An apparatus for enumerating frequent sequences according to Item 1 or 2.
The sorted array storage unit includes a suffix array storage unit that stores the suffix array generated by the suffix array creation unit, a reverse array of the suffix array, and the original series data. Item 4. The frequent sequence listing device according to Item 3.
The sorted array comprises a prefix array;
The system further comprises: a prefix array creating unit that generates the prefix array, the reverse array of the prefix array, and the original series data while receiving a series data set as an input and inserting a break of the series data. Item 3. An apparatus for enumerating frequent sequences according to Item 1 or 2.
The sorted array storage unit includes a prefix array storage unit that stores the prefix array, the reverse array of the prefix array, and the original series data generated by the prefix array creation unit. Item 6. The frequent sequence listing device according to Item 5.
A method of enumerating a high-frequency series from within a data string using an enumeration device,
A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Steps,
A calculation step of counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping step for narrowing down the set of positions of interest;
A method for enumerating frequent sequences.
The calculation step uses the sorted array and the set of positions on the sorted array as inputs, and calculates the number of characters existing on the original series data by using the sorted array being ordered. 8. The frequent sequence enumeration method according to claim 7, wherein whether or not to perform a binary search is switched according to frequency when counting.
A computer-readable recording medium recording a program for causing a computer to list a high-frequency series from within a data string, wherein the program is stored in the computer,
A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Procedure and
A calculation procedure for counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping procedure for narrowing down the set of target positions;
A computer-readable recording medium for executing
In the calculation procedure, the sorted array is ordered based on the number of characters existing in the original series data by inputting the sorted array and a set of positions on the sorted array to the computer. The computer-readable recording medium according to claim 9, wherein, when counting by using, whether or not to perform a binary search is switched depending on frequency.