WO2015025751A1 - Frequent sequence enumeration device, method, and recording medium - Google Patents

Frequent sequence enumeration device, method, and recording medium Download PDF

Info

Publication number
WO2015025751A1
WO2015025751A1 PCT/JP2014/071136 JP2014071136W WO2015025751A1 WO 2015025751 A1 WO2015025751 A1 WO 2015025751A1 JP 2014071136 W JP2014071136 W JP 2014071136W WO 2015025751 A1 WO2015025751 A1 WO 2015025751A1
Authority
WO
WIPO (PCT)
Prior art keywords
array
suffix
sorted
series data
sorted array
Prior art date
Application number
PCT/JP2014/071136
Other languages
French (fr)
Japanese (ja)
Inventor
幸貴 楠村
優輔 村岡
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2015025751A1 publication Critical patent/WO2015025751A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Definitions

  • the present invention relates to an apparatus for enumerating high-frequency series from a data string such as text data, purchase data, DNA (deoxyribonucleic acid) data, and the like.
  • the “frequency series enumeration device” refers to a high-frequency partial series from a database (series database) composed of data (series data) having a meaningful order relationship such as purchase data, DNA data, and tests. It is a device to enumerate.
  • Sequence data is an order-related character string.
  • the “character” is a symbol having an arbitrary number of types. Thus, for example, in the case of purchase data, an item (product) may be a character, and in the case of text data, one word may be a character.
  • “Partial sequence” refers to a partial sequence that appears in sequence in sequence data.
  • a partial series of 4-character series data ABCD is: AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD.
  • PrefixSpan is an algorithm that repeats frequency calculation and processing called database mapping.
  • the database is abbreviated as DB.
  • FIG. 1 is a block diagram showing a configuration of an enumeration apparatus 10 using PrefixSpan.
  • the enumeration apparatus 10 includes a sequence DB management unit 11, a frequency calculation unit 12, a control unit 13, and a DB mapping unit 14.
  • the enumeration apparatus 10 first performs frequency calculation using the frequency calculation unit 12. In this process, the frequency of each character in the series DB is counted. In this example, A appears 5 times, B appears 3 times, C appears 3 times, and D appears once. Among these, since the series data ⁇ A, B, C ⁇ has a frequency of 2 or more, the control unit 13 outputs three series of A, B, and C. Furthermore, DB mapping using characters with a frequency of 2 or more is performed on the first series DB using the control unit 13 and the DB mapping unit 14.
  • DB mapping using the character x refers to a process of extracting a record including the character x from the series DB and extracting a character string (suffix) after the character x. For example, if DB mapping using the letter A is performed on the sequence DB ⁇ ACD, ABC, CBA, AAB ⁇ , a sequence DB consisting of three sequences ⁇ CD, BC, AB ⁇ is newly obtained. And this enumeration apparatus 10 performs frequency calculation with respect to this series DB, and obtains the result that A is once, B is twice, and C is twice. As a result, the control unit 13 outputs AB and AC on the assumption that AB appears twice and AC appears twice.
  • Patent Document 1 discloses a “sequence pattern extraction apparatus and method” capable of improving the efficiency of processing related to projection.
  • sequence pattern extraction apparatus disclosed in Patent Document 1 projection is performed by associating and storing a projection position for specifying an item to be extracted during projection with respect to items constituting sequence data. The position to perform is acquired efficiently, and the process concerning projection is performed efficiently.
  • Non-Patent Document 1 the first problem of the enumeration apparatus 10 disclosed in Non-Patent Document 1 is that the processing time is long. In the conventional method, the repetition frequency calculation process is performed after the DB mapping. However, when the number of series data increases, these processing costs increase. In addition, the said nonpatent literature 2 only discloses the suffix arrangement
  • the frequent sequence enumeration apparatus provides a sorted array that makes it possible to refer to a position on the original series data from a position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order.
  • the sorted array is ordered by using the sorted array storage unit to store; the sorted array and the set of positions on the sorted array as input, and the number of characters existing on the original series data
  • Sorted array frequency calculator that counts; sets of positions in the sorted array and a specific character as input, adds or subtracts the value in the sorted array, and adds a suffix or prefix adjacent to the input character
  • a sorted array mapping unit that refers to the appearance positions and narrows down a set of target positions.
  • the frequency calculation process can be performed efficiently.
  • FIG. 1 is a block diagram showing the structure of a related enumeration apparatus disclosed in Non-Patent Document 1.
  • FIG. 2 is a block diagram showing the configuration of the frequent sequence listing apparatus according to the first embodiment of the present invention.
  • FIG. 3 is a diagram for explaining the suffix arrangement.
  • FIG. 4 is a diagram showing an example of a suffix array.
  • FIG. 5 is a diagram showing a processing flow of the frequent sequence listing apparatus shown in FIG.
  • FIG. 6 is a diagram showing a processing flow of the frequency calculation processing f2 in FIG.
  • FIG. 7 is a diagram showing a processing flow of the binary search processing c03 in FIG.
  • FIG. 8 is a diagram showing a flow of the SA mapping processing f6 in FIG.
  • FIG. 9 is a diagram showing a processing flow of suffix extension in FIG.
  • FIG. 10 is a block diagram showing a configuration of a frequent sequence listing apparatus according to the second embodiment of the present invention.
  • FIG. 11 is a diagram showing another example of the suffix array.
  • FIG. 12 is a diagram illustrating an example of the output of the frequency calculation unit.
  • FIG. 13 is a diagram schematically illustrating a target suffix list obtained by SA mapping of the target sequence A.
  • FIG. 14 is a diagram illustrating another example of the output of the frequency calculation unit.
  • FIG. 15 is a diagram schematically showing a target suffix list obtained by SA mapping of the target sequence AA.
  • FIG. 16 is a diagram schematically illustrating a target suffix list obtained by SA mapping of the target sequence AB.
  • the suffix arrangement is the data structure shown in Non-Patent Document 2 above.
  • the suffix array is an array that holds the positions when all suffixes appearing in a character string are arranged in dictionary order.
  • FIG. 3 shows an example of the suffix array of the character string abraca. The leftmost list in FIG. 3 represents all suffixes for the character string abraca. In this figure, for the sake of convenience, the character $ that comes first in the dictionary order is given at the end.
  • the symbol S i is assigned to each suffix. S i is a symbol that means a suffix after the i-th character. The list in the center of FIG.
  • the suffix array indicates where the suffixes at arbitrary positions in the dictionary order exist on the original character string.
  • the suffix array is a data structure mainly used for document search, and it is known that an arbitrary partial character string can be extracted by a log (character string).
  • the suffix arrangement may be simply abbreviated as SA.
  • SA the suffix arrangement
  • the suffix sequence is SA [i]
  • the reverse sequence is INV [i].
  • the suffix array SA [i] represents the character number in the original character string that the i-th suffix in the dictionary order appears.
  • the inverse array INV [i] represents the number of the suffix that appears in the original character string in the dictionary order.
  • the suffix to the right of the i-th suffix on the suffix array is INV [SA [i] +1] Can be calculated by calculating. For example, in the SA array of FIG.
  • the suffix array has a property that a suffix set to the right of a suffix having the same prefix appears in the same order on the suffix array.
  • FIG. 4 shows a suffix array for a character string abracadara.
  • the suffixes starting with the same prefix a are the 1st to 5th suffixes, which are the 10th, 7th, 0th, 3rd and 5th on the original character string.
  • the suffix set to the right of this list is the set of the 11th, 8th, 1st, 4th and 6th suffixes on the original character string. These appear in the 0th, 6th, 7th, 8th, and 9th in dictionary order, respectively, and it is understood that five suffixes are arranged in the same order in the dictionary order.
  • a suffix array is a lexicographical sort of all suffixes, and suffixes with the same prefix are compared with the next character after that prefix. In this example, since a is the same, since the comparison is made in the order of the second and subsequent characters, this order is always maintained.
  • the frequent sequence listing apparatus 20 includes an SA creation unit 21, an SA storage unit 22, an SA frequency calculation unit 23, an SA mapping unit 24, and a control. Part 25.
  • the SA creation unit 21 receives a series data set as input, and generates a suffix array, a reverse array of the suffix array, and original series data while inserting a delimiter of the series data.
  • the SA storage unit 22 stores the suffix array created by the SA creation unit 21, the reverse arrangement of the suffix array, and the original series data.
  • the SA frequency calculation unit 23 refers to the data in the SA storage unit 22, calculates the frequency using the fact that the data is arranged in the dictionary order, and inputs the result to the control unit 25.
  • the SA mapping unit 24 calculates a suffix pointer to be referred to next based on the pointer specified by the control unit 25.
  • the SA storage unit 22 stores a suffix array that makes it possible to refer to the position on the series data from the position on the dictionary when all the suffixes about the series data are arranged in dictionary order.
  • the SA frequency calculation unit 23 receives a set of positions on the suffix array as input, and counts the number of characters included in the suffix referenced by the suffix array by using the fact that the suffixes are ordered.
  • the SA mapping unit 24 takes a set of positions in the suffix array and a specific character as input, adds 1 to the value in the suffix array, and refers to the reverse array of the suffix array, thereby inputting the input character Returns the position on the suffix array for.
  • the SA frequency calculation unit 23 takes the suffix array and a set of positions on the suffix array as inputs, and uses the fact that the suffix array is ordered to determine the number of characters existing on the original series data. When counting, whether to perform a binary search is switched depending on the frequency.
  • the SA creation unit 21 constructs a suffix array based on a set of series data (step f1). This process is performed in the following procedure.
  • the SA creating unit 21 creates one character string by connecting the series data while inserting a delimiter. For example, ⁇ CAABC, ABCB, CABC, ABBCA ⁇
  • the character string created with “%” as the delimiter and “$” as the delimiter is “CAABC% ABCB% CABC% ABBCA% $” It becomes.
  • the SA creation unit 21 constructs a suffix array SA based on the created character string. Then, the SA creation unit 21 constructs a reverse array INV based on the suffix array SA.
  • the SA creation unit 21 stores three of the combined character string T, suffix array SA, and reverse array INV in the SA storage unit 21.
  • the SA frequency calculation unit 23 performs frequency calculation processing with reference to the SA (step f2). This process is shown in FIG.
  • the frequency calculation process is processed with the suffix array SA, the character string T, the target suffix list P, and the target series TS as inputs.
  • the target suffix list P is a list representing the subscripts of the suffix array SA.
  • the target sequence TS represents the sequence that is the source of SA mapping at that time. At the first execution, there is no mapping source, and all suffixes are targeted.
  • the SA mapping unit 24 sets values stored in the SA storage unit 22.
  • the SA frequency calculation unit 23 first initializes a variable s representing a pointer on P as 0 (initialization process c01).
  • the SA frequency calculation unit 25 extracts the character corresponding to the sth subscript of P by T [P [SA [s]]] and sets it as c (character extraction processing c02).
  • the SA frequency calculation unit 23 extracts a position where the first character is not c on P by binary search, and sets it as e (binary search processing c03).
  • the binary search process c03 is performed by the process of FIG.
  • the binary search process c03 receives the character c and the target suffix list P as input, and executes the final position where the first character is c in the suffix in P by the binary search algorithm.
  • the search range of the binary search is started with a length of 0 to P. Therefore, the SA frequency calculation unit 23 sets the pointers l and u representing the ranges to the lengths of 0 and P, respectively (step b01). .
  • the SA frequency calculation unit 23 calculates a position m that is in the middle of the range (step b02), extracts a character at that position by T [SA [P [m]]], and sets it to t (step b02). b03).
  • the SA frequency calculation unit 23 performs a comparison process comp (c, t) between c and t (step b04).
  • the comparison process comp (a, b) compares the character a and the character b in the dictionary order, 0 if the character a and the character b match, 1 if the character a ⁇ character b in the dictionary order, 1 in the dictionary order If character a> character b, the process returns -1.
  • the SA frequency calculation unit 23 sets the start position to make the latter half of the search range. m is substituted for l (step b05).
  • the SA frequency calculation unit 23 substitutes m for the end position u to make the first half of the search range (step b06).
  • the SA frequency calculation unit 23 performs an end determination (step b07). In the end determination, the SA frequency calculation unit 23 checks whether l + 1 is equal to u in order to determine whether the search range is narrowed down to one line. If so, the SA frequency calculation unit 23 outputs l as the last position where the character c appears, otherwise, the SA frequency calculation unit 23 returns to step b02 and performs the same processing. Returning to FIG.
  • the SA frequency calculation unit 23 determines that it has reached the end and ends the process. Otherwise, the SA frequency calculation unit 23 returns to the process c02 and performs the same process.
  • the frequency calculation process normally requires order 0 (P. length) of the target suffix list P, but since this algorithm uses a binary search, 0 is assumed when the number of character types is W. (W ⁇ log (P. length)) processing time can be implemented.
  • P.P A threshold is set in advance for the length.
  • the frequency calculation unit 23 inputs the resulting character, frequency, start position, end position list, target suffix list P, and target sequence TS to the control unit 25.
  • the control unit 25 performs an output determination process f3 for determining whether or not a character having a frequency equal to or higher than a preset frequency threshold is included based on the frequency list of each character.
  • the control unit 25 outputs the character string connected to the target sequence TS and adds the character to the stack to the target sequence TS (processing f4).
  • the control unit 25 takes out the top series of the stack (processing f5).
  • the control unit 25 inputs the target suffix list and the sequence to the SA mapping unit 24.
  • the SA mapping unit 24 performs DB mapping processing (processing f6).
  • the SA mapping unit 24 receives the sequence S and the target suffix list as input, and refers to the data in the suffix array in the SA storage unit 22 to suffix the suffix appearing after the last character c in the sequence S.
  • An array is constructed and the target suffix list P is updated. This process will be described with reference to FIG. First, the SA mapping unit 24 creates an inverse array P_INV of the target suffix list P (step S11).
  • the SA mapping unit 24 adds a suffix array to the suffixes included in the start position and the end position of the suffix prefixed with the last character c of the sequence S in the output target suffix list P. Is expanded (step S12). This process will be described with reference to FIG. This process is performed using the start position s, end position e, target suffix list P, and target suffix list P_INV of the character c.
  • n INV [SA [P [k]] + 1] (step S22). 3.
  • T [SA [n]] is extracted by T [SA [n]] and it is determined whether or not it is a delimiter character “%” (step S23). 4). If so, the process ends (step S24). 5. Otherwise, the position of the position on the target suffix list is calculated by P_INV [SA [P [k]] + 1 (step S25), the position is output, and the process returns to step S22.
  • P_INV [SA [P [k]] + 1 step S25
  • the SA mapping unit 24 makes the position obtained by this processing a new target suffix list and stores it in the SA storage unit 22. Further, the SA mapping unit 24 stores the sequence S as a target sequence (step S13).
  • the SA frequency calculation unit 23 performs frequency calculation processing based on the new target suffix list P and the target series TS (frequency calculation processing). f2).
  • the frequent sequence listing device 20 operates by repeating this process (f2 to f6) until the stack in the control unit 25 becomes empty. Each unit of the frequent sequence listing device 20 may be realized by using a combination of hardware and software.
  • enumeration programs are expanded in RAM (random access memory), and each part is realized as various means by operating hardware such as a control unit (CPU) based on the program. To do. Further, the program may be recorded on a recording medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
  • an information processing device that operates as a frequent sequence enumeration device 20 is based on an enumeration program expanded in a RAM, an SA creation unit 21, an SA storage unit 22, This can be realized by operating as the SA frequency calculation unit 23, the SA mapping unit 24, and the control unit 25.
  • the frequent sequence listing device 20 of the first embodiment will be described.
  • DB mapping and frequency calculation processing are performed using a suffix array and its inverse array. If a suffix array is used, a list of suffixes existing to the right of a certain character can be efficiently extracted, so that it is not necessary to copy repeated data.
  • the enumeration apparatus 20 uses a suffix array and its reverse array, refers to two arrays to move to the right one character, and designates INV [SA [i] +1]. Moved, but you can use prefix sequences in a similar way.
  • the prefix array may be simply abbreviated as PA.
  • the “prefix array” is obtained by rearranging all the prefixes in a given character string in dictionary order.
  • a frequent sequence listing device 20A includes a PA creation unit 21A, a PA storage unit 22A, a PA frequency calculation unit 23A, a PA mapping unit 24A, and a control. 25A.
  • the PA creation unit 21A receives a series data set as input, and generates a prefix array, a reverse array of the prefix array, and original series data while inserting a series data break.
  • the PA storage unit 22A stores the prefix array created by the PA creation unit 21A, the reverse array of the prefix array, and the original series data.
  • the PA frequency calculation unit 23A refers to the data in the PA storage unit 22A, performs frequency calculation using the fact that the data is arranged in dictionary order, and inputs the result to the control unit 25A.
  • the PA mapping unit 24A calculates a prefix pointer to be referred to next based on the pointer designated by the control unit 25A.
  • the PA storage unit 22A stores a prefix array that makes it possible to refer to the position on the series data from the position on the dictionary when all the prefixes about the series data are arranged in the dictionary order.
  • the PA frequency calculation unit 23A receives a set of positions on the prefix array as an input, and counts the number of characters included in the prefix referenced by the prefix array using the fact that the prefixes are ordered.
  • the PA mapping unit 24A takes a set of positions in the prefix array and a specific character as input, subtracts 1 from the value in the prefix array, and refers to the reverse array of the prefix array, thereby inputting the input character. Returns the position on the prefix array for. Further, the PA frequency calculation unit 23A takes the prefix array and a set of positions on the prefix array as inputs, and uses the fact that the prefix array is ordered to determine the number of characters existing on the original series data. When counting, whether to perform a binary search is switched depending on the frequency.
  • each unit of the frequent sequence listing device 20A may be realized by using a combination of hardware and software.
  • enumeration programs are expanded in RAM (random access memory), and each part is realized as various means by operating hardware such as a control unit (CPU) based on the program. To do. Further, the program may be recorded on a recording medium and distributed.
  • the program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like.
  • the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
  • an information processing device (computer) that operates as a frequent-sequence enumeration device 20A is based on a PA creation unit 21A, a PA storage unit 22A, This can be realized by operating as the PA frequency calculation unit 23A, the PA mapping unit 24A, and the control unit 25A.
  • DB mapping and frequency calculation processing are performed using a prefix array and its reverse array. If a prefix array is used, a list of prefixes existing to the right of a certain character can be efficiently extracted, so there is no need to copy repeated data. Moreover, since the prefixes are always arranged in the dictionary order, the frequency can be calculated efficiently using this property.
  • the “suffix array” and “prefix array” are collectively referred to as “sorted array”.
  • the frequent sequence enumeration apparatus that represents the first embodiment and the second embodiment in a high-level concept is not shown, but the sorted array creation unit, the sorted array storage unit, and the sorted It consists of an array frequency calculation unit, a sorted array mapping unit, and a control unit.
  • the sorted array storage unit makes it possible to refer to the position on the original series data from the position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order.
  • the sorted array frequency calculation unit receives the set of the sorted array and the position on the sorted array as input, and counts the number of characters existing in the original series data using the fact that the sorted array is ordered. .
  • the sorted array mapping unit takes a set of positions in the sorted array and a specific character as input, and adds or subtracts the value in the sorted array, and the appearance position of the suffix or prefix adjacent to the input character To narrow down the set of target positions.
  • the sorted array frequency calculation unit uses a sorted array and a set of positions on the sorted array as inputs, and uses the number of characters existing in the original series data to order the sorted arrays. When counting, whether to perform a binary search is switched depending on the frequency.
  • the SA frequency calculation unit 23 performs a binary search process and calculates the frequency of each character (frequency calculation process f2 of FIG. 5).
  • the target suffix list P is 0 to 23
  • the target series TS is an empty character string “”.
  • the result is shown in FIG. In FIG. 12, the frequency and the start position and end position on all SAs are output for the characters A, B, and C, respectively.
  • the SA frequency calculation unit 23 outputs 0 to 23 for the target suffix list P, the empty character string "" for the target series TS, and the data of FIG.
  • the control unit 25 examines the data in FIG. 12, outputs characters A, B, and C (output determination processing f3 in FIG.
  • Step f4 in FIG. 5 the control unit 25 further extracts A at the top of the stack (step f5 in FIG. 5), and inputs the target suffix list P (0 to 23) and the sequence (A) to the SA mapping unit 24.
  • T [SA [10]] A.
  • ⁇ 6, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 23 ⁇ is obtained as a target suffix array list based on the target sequence A. Is output to the SA storage unit 22.
  • FIG. 13 shows a pseudo partial suffix arrangement diagram in which only these lines are extracted.
  • the SA frequency calculation unit 23 performs frequency calculation based on these data (frequency calculation processing f2 in FIG. 5). The result is shown in FIG.
  • the control means 25 combines these and the target series A, and outputs AA, AB, and AC. Further, the control unit 25 adds these to the stack ⁇ C, B ⁇ , and adds the stack to ⁇ C, B, AC, AB, AA ⁇ . (Output determination process f3 in FIG. 5). Then, the control unit 25 takes out the uppermost series AA and performs the same process again. That is, the target suffix list of FIG. 15 is obtained by the next SA mapping.
  • the control unit 25 does not output anything and performs processing for the next sequence AB in the stack.
  • the result of SA mapping is as shown in FIG. 16, and outputs of ABB 3 times and ABC 4 times are obtained.
  • the enumeration apparatus 20 performs such recursive calculation using a suffix array instead of a DB map.
  • the calculation time of the mapping itself is the data length as in the conventional case, but since the frequency calculation after mapping can be performed by performing a binary search, it can be operated efficiently. While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments.
  • Sorted array storage for storing a sorted array that allows reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order
  • Sorted array frequency that takes the sorted array and a set of positions on the sorted array as input, and counts the number of characters existing on the original series data by using the fact that the sorted array is ordered
  • a calculation unit A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A sorted array mapping section that narrows down the set of target positions;
  • the sorted array frequency calculation unit uses the sorted array and a set of positions on the sorted array as inputs, and the sorted array orders the number of characters existing on the original series data.
  • the sorted array consists of a suffix array,
  • the apparatus further comprises a suffix array creation unit that generates the suffix array, the reverse array of the suffix array, and the original series data while inserting the series data delimiter using the series data set as an input.
  • the frequent sequence listing apparatus according to 1 or 2.
  • the sorted array storage unit is a suffix array storage unit that stores the suffix array, the reverse array of the suffix array, and the original series data generated by the suffix array generation unit.
  • the frequent-sequence enumeration apparatus according to appendix 3, comprising: (Supplementary note 5)
  • the sorted array consists of a prefix array,
  • the system further comprises: a prefix array creating unit that generates the prefix array, the reverse array of the prefix array, and the original series data while receiving the series data set as an input and inserting a break of the series data
  • the frequent sequence listing apparatus according to 1 or 2.
  • the sorted array storage unit is a prefix array storage unit that stores the prefix array, the reverse array of the prefix array, and the original series data generated by the prefix array creation unit.
  • the frequent-sequence enumeration apparatus comprising: (Supplementary note 7) A method of enumerating a high-frequency series from within a data string using an enumeration device, A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Steps, A calculation step of counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs, A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance
  • the sorted array is ordered with the number of characters existing on the original series data, with the sorted array and a set of positions on the sorted array as inputs.
  • the frequent sequence enumeration method according to appendix 7, in which whether or not to perform a binary search is switched according to frequency when counting using. (Supplementary note 9)
  • the sorted array consists of a suffix array, Appendix 7 or 8 further comprising a creation step of generating the suffix array, the reverse array of the suffix array, and the original series data while inputting the series data set as an input and inserting a break of the series data
  • the suffix array generated in the creation step is generated in the suffix array storage section which is the sorted array storage section.
  • the frequent sequence listing method according to appendix 9, wherein: (Supplementary note 11)
  • the sorted array consists of a prefix array, Supplementary note 7 or 8 further comprising a creation step of generating the prefix array, the reverse array of the prefix array, and the original series data while inputting the series data set as an input and inserting a break of the series data.
  • the prefix array, the reverse array of the prefix array, and the original series data generated in the creation step are stored in the prefix array storage section which is the sorted array storage section. And the frequent sequence listing method according to attachment 11.
  • a computer-readable recording medium recording a program for causing a computer to list a high-frequency sequence from a data string, wherein the program is stored in the computer, A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Procedure and A calculation procedure for counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs, A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping procedure for narrowing down the set of target positions; A computer-readable recording medium for executing (Supplementary Note 14) In the calculation procedure, the sorted array orders the number of characters existing on the original series
  • the computer-readable recording medium according to appendix 13 wherein, when counting is performed, whether to perform a binary search is switched depending on frequency.
  • the sorted array consists of a suffix array
  • the program has the creation procedure of generating the suffix array, the reverse array of the suffix array, and the original series data while inserting the series data delimiter into the computer while inputting the series data set.
  • the prefix array storage unit that is the sorted array storage unit, the prefix array generated in the creation step, the reverse array of the prefix array, 18.
  • the present invention can be used to quickly calculate characteristic patterns that frequently appear in analysis of purchase logs, analysis of DNA, analysis of text data, and log data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In order to efficiently carry out a frequency calculation process, a frequent sequence enumeration device comprises: a suffix array storage unit which, when all suffixes for sequence data are arranged in dictionary order, stores a suffix array whereby it is possible to reference a location on the sequence data from a location on a dictionary; a suffix array frequency calculation unit which, taking a set of locations on the suffix array as input, counts the number of times a character is included in the suffixes referenced by the suffix array, using the fact that the suffixes are sequenced; and a suffix array mapping unit which, taking a set of locations in the suffix array and a specified character as input, returns an appearance location on the suffix array with respect to the inputted character by adding one to the value within the suffix array and referencing an inverted array of the suffix array.

Description

頻出系列の列挙装置、方法および記録媒体Frequent sequence enumeration apparatus, method and recording medium
 本発明は、テキストデータ、購買データ、DNA(deoxyribonucleic acid)データなどのデータ列内から高頻度な系列を列挙する装置に関する。 The present invention relates to an apparatus for enumerating high-frequency series from a data string such as text data, purchase data, DNA (deoxyribonucleic acid) data, and the like.
 大規模なデータベースから利用者にとって有用なパターンを発見する手法として、頻出系列の列挙装置が知られている。ここで、「頻出系列の列挙装置」とは、購買データやDNAデータ、テストなどの順序関係に意味を持つデータ(系列データ)により構成されたデータベース(系列データベース)から、高頻度な部分系列を列挙する装置である。「系列データ」とは順序関係の文字列である。なお、ここで「文字」とは、任意の種類数を持つ記号である。
 よって例えば、購買データの場合、アイテム(商品)を文字、テキストデータの場合では、1単語を文字としても良い。
 「部分系列」とは、系列データ内で順序を保った登場する部分列を指す。例えば、4文字から成る系列データABCDの部分系列は、
 AB,AC,AD,BC,BD,CD,ABC,ABD,ACD,BCD,ABCDである。
 頻出系列の列挙を実現する方法としては、PrefixSpanを用いた方法が知られている(非特許文献1参照)。PrefixSpanは、頻度計算と、データベース写像と呼ばれる処理とを繰り返すアルゴリズムである。データベースは、DBと略称される。
 図1は、PrefixSpanを用いた列挙装置10の構成を示すブロック図を示す。列挙装置10は、系列DB管理部11と、頻度計算部12と、制御部13と、DB写像部14とを有する。
 例えば、系列DB管理部11内の系列DBに{ACD,ABC,CBA,AAB}という4つの系列があり、頻度の閾値として2が指定されたとする。この場合に、この列挙装置10はまず、頻度計算部12を用いて頻度計算を行う。この処理では系列DB内の各文字の頻度が数えられる。この例では、Aが5回,Bが3回,Cが3回,Dが1回登場している。このうち系列データ{A,B,C}がそれぞれ頻度2以上であるため、制御部13はA,B,Cの三つの系列を出力する。さらに制御部13、DB写像部14を用いて最初の系列DBに対して、頻度2以上の文字を用いたDB写像を行う。
 「文字xを用いたDB写像」とは、系列DBから、文字xを含むレコードを取り出し、文字x以降の文字列(接尾辞)を取り出す処理を言う。
 例えば、系列DB{ACD,ABC,CBA,AAB}に対し、文字Aを用いたDB写像を行うと、新たに{CD,BC,AB}という3つの系列からなる系列DBが得られる。
 そして,この列挙装置10は、この系列DBに対して頻度計算を行い、Aが1回、Bが2回、Cが2回という結果を得る。この結果、制御部13は、ABが2回、ACが2回登場しているとしてABとACを出力する、この処理は深さ優先探索で再帰的に実施され、元の系列DB内で頻度2以上のすべての系列が列挙される。
 PrefixSpanは、頻度計算と、DB写像と呼ばれる処理とを繰り返すことで、頻度計算を行う。
 非特許文献2は、接尾辞配列について開示している。
 また、本発明に関連する先行技術文献も知られている。例えば、特許文献1は、射影にかかる処理の効率化を図ることが可能な「系列パタン抽出装置及び方法」を開示している。この特許文献1に開示された系列パタン抽出装置では、系列データを構成するアイテムに対し、射影時において抽出の対象となるアイテムを特定するための射影位置を関連付けて記憶しておくことで、射影する位置を効率的に取得し、射影にかかる処理を効率的に行っている。
An enumeration apparatus for frequent sequences is known as a technique for discovering patterns useful for users from a large-scale database. Here, the “frequency series enumeration device” refers to a high-frequency partial series from a database (series database) composed of data (series data) having a meaningful order relationship such as purchase data, DNA data, and tests. It is a device to enumerate. “Sequence data” is an order-related character string. Here, the “character” is a symbol having an arbitrary number of types.
Thus, for example, in the case of purchase data, an item (product) may be a character, and in the case of text data, one word may be a character.
“Partial sequence” refers to a partial sequence that appears in sequence in sequence data. For example, a partial series of 4-character series data ABCD is:
AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD.
As a method for realizing enumeration of frequent sequences, a method using PrefixSpan is known (see Non-Patent Document 1). PrefixSpan is an algorithm that repeats frequency calculation and processing called database mapping. The database is abbreviated as DB.
FIG. 1 is a block diagram showing a configuration of an enumeration apparatus 10 using PrefixSpan. The enumeration apparatus 10 includes a sequence DB management unit 11, a frequency calculation unit 12, a control unit 13, and a DB mapping unit 14.
For example, it is assumed that there are four sequences {ACD, ABC, CBA, AAB} in the sequence DB in the sequence DB management unit 11 and 2 is specified as a frequency threshold. In this case, the enumeration apparatus 10 first performs frequency calculation using the frequency calculation unit 12. In this process, the frequency of each character in the series DB is counted. In this example, A appears 5 times, B appears 3 times, C appears 3 times, and D appears once. Among these, since the series data {A, B, C} has a frequency of 2 or more, the control unit 13 outputs three series of A, B, and C. Furthermore, DB mapping using characters with a frequency of 2 or more is performed on the first series DB using the control unit 13 and the DB mapping unit 14.
“DB mapping using the character x” refers to a process of extracting a record including the character x from the series DB and extracting a character string (suffix) after the character x.
For example, if DB mapping using the letter A is performed on the sequence DB {ACD, ABC, CBA, AAB}, a sequence DB consisting of three sequences {CD, BC, AB} is newly obtained.
And this enumeration apparatus 10 performs frequency calculation with respect to this series DB, and obtains the result that A is once, B is twice, and C is twice. As a result, the control unit 13 outputs AB and AC on the assumption that AB appears twice and AC appears twice. This process is recursively performed by a depth-first search, and the frequency in the original sequence DB. All series of 2 or more are listed.
PrefixSpan performs frequency calculation by repeating frequency calculation and processing called DB mapping.
Non-Patent Document 2 discloses a suffix arrangement.
Prior art documents related to the present invention are also known. For example, Patent Document 1 discloses a “sequence pattern extraction apparatus and method” capable of improving the efficiency of processing related to projection. In the sequence pattern extraction apparatus disclosed in Patent Document 1, projection is performed by associating and storing a projection position for specifying an item to be extracted during projection with respect to items constituting sequence data. The position to perform is acquired efficiently, and the process concerning projection is performed efficiently.
特開2009−169850号公報JP 2009-169850 A
 しかしながら、上記非特許文献1に開示された列挙装置10の第1の問題点は、処理時間が長いことである。従来の方法では、DB写像後に繰り返し頻度計算処理が実施されるが、系列データの数が大きくなるとこれらの処理コストが大きくなってしまう。
 なお、上記非特許文献2は接尾辞配列を開示しているに過ぎない。
 一方、特許文献1は、射影時において抽出の対象となるアイテムを特定するための射影位置を関連付けて記憶する技術を開示しているに過ぎない。
[発明の目的]
 本発明の目的は、頻度計算処理を効率よく行う高速な頻出系列の列挙装置を実現することである。
However, the first problem of the enumeration apparatus 10 disclosed in Non-Patent Document 1 is that the processing time is long. In the conventional method, the repetition frequency calculation process is performed after the DB mapping. However, when the number of series data increases, these processing costs increase.
In addition, the said nonpatent literature 2 only discloses the suffix arrangement | sequence.
On the other hand, Patent Literature 1 merely discloses a technique for storing a projection position in association with each other for specifying an item to be extracted at the time of projection.
[Object of the invention]
An object of the present invention is to realize a high-speed frequent sequence enumeration apparatus that efficiently performs frequency calculation processing.
 本発明による頻出系列の列挙装置は、系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に、辞書上の位置から元系列データ上の位置を参照可能にするソート済み配列を記憶するソート済み配列記憶部と;ソート済み配列とソート済み配列上の位置の集合を入力として、元系列データ上に存在する文字の数を、ソート済み配列が順序付けされていることを利用して数えるソート済み配列頻度計算部と;ソート済み配列内の位置の集合と、特定の文字を入力として、ソート済み配列内の値に対する加減算を行い、入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込むソート済み配列写像部と;を備える。 The frequent sequence enumeration apparatus according to the present invention provides a sorted array that makes it possible to refer to a position on the original series data from a position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order. Using the fact that the sorted array is ordered by using the sorted array storage unit to store; the sorted array and the set of positions on the sorted array as input, and the number of characters existing on the original series data Sorted array frequency calculator that counts; sets of positions in the sorted array and a specific character as input, adds or subtracts the value in the sorted array, and adds a suffix or prefix adjacent to the input character A sorted array mapping unit that refers to the appearance positions and narrows down a set of target positions.
 本発明では、頻度計算処理を効率よく行うことができる。 In the present invention, the frequency calculation process can be performed efficiently.
 図1は非特許文献1に開示された関連の列挙装置の構造を示すブロック図である。
 図2は本発明の第1の実施形態に係る、頻出系列の列挙装置の構成を示すブロック図である。
 図3は接尾辞配列を説明する図である。
 図4は接尾辞配列の例を示す図である。
 図5は図2に示した頻出系列の列挙装置の処理フローを示す図である。
 図6は図5における頻度計算処理f2の処理フローを示す図である。
 図7は図6における2分検索処理c03の処理フローを示す図である。
 図8は図5におけるSA写像処理f6のフローを示す図である。
 図9は図8における接尾辞の拡張の処理フローを示す図である。
 図10は本発明の第2の実施形態に係る、頻出系列の列挙装置の構成を示すブロック図である。
 図11は接尾辞配列の別の例を示す図である。
 図12は頻度算出部の出力の例を示す図である。
 図13は対象系列AのSA写像により得られる対象接尾辞リストを擬似的に示す図である。
 図14は頻度算出部の出力の別の例を示す図である。
 図15は対象系列AAのSA写像により得られる対象接尾辞リストを擬似的に示す図である。
 図16は対象系列ABのSA写像により得られる対象接尾辞リストを擬似的に示す図である。
FIG. 1 is a block diagram showing the structure of a related enumeration apparatus disclosed in Non-Patent Document 1.
FIG. 2 is a block diagram showing the configuration of the frequent sequence listing apparatus according to the first embodiment of the present invention.
FIG. 3 is a diagram for explaining the suffix arrangement.
FIG. 4 is a diagram showing an example of a suffix array.
FIG. 5 is a diagram showing a processing flow of the frequent sequence listing apparatus shown in FIG.
FIG. 6 is a diagram showing a processing flow of the frequency calculation processing f2 in FIG.
FIG. 7 is a diagram showing a processing flow of the binary search processing c03 in FIG.
FIG. 8 is a diagram showing a flow of the SA mapping processing f6 in FIG.
FIG. 9 is a diagram showing a processing flow of suffix extension in FIG.
FIG. 10 is a block diagram showing a configuration of a frequent sequence listing apparatus according to the second embodiment of the present invention.
FIG. 11 is a diagram showing another example of the suffix array.
FIG. 12 is a diagram illustrating an example of the output of the frequency calculation unit.
FIG. 13 is a diagram schematically illustrating a target suffix list obtained by SA mapping of the target sequence A.
FIG. 14 is a diagram illustrating another example of the output of the frequency calculation unit.
FIG. 15 is a diagram schematically showing a target suffix list obtained by SA mapping of the target sequence AA.
FIG. 16 is a diagram schematically illustrating a target suffix list obtained by SA mapping of the target sequence AB.
 本発明の理解を容易にするために、接尾辞配列について説明する。「接尾辞配列」とは、上記非特許文献2で示されるデータ構造である。
 接尾辞配列は、文字列中に登場するすべての接尾辞を辞書順で並べた場合の位置を保持した配列である。
 図3に文字列abracaの接尾辞配列の例を示す。図3の一番左のリストは文字列abracaに対する全接尾辞を表す。なお、この図では便宜的に、末尾に辞書順で先頭に来る文字$を付与している。ここで各接尾辞にSという記号を割り当てるものとする。Sは、i番目の文字以降の接尾辞を意味する記号である。図3の中央にあるリストは左のリストを辞書順にならべ変えたものである。さらに図3の右にある整数値のリストは、中央の接尾辞リストの添え字を取り出したものであり、これが接尾辞配列となる。
 接尾辞配列は、辞書順で任意の位置にある接尾辞が元文字列上のどこに存在するかを表したものである。接尾辞配列は、主に文書検索に用いられるデータ構造であり、任意の部分文字列をlog(文字列)で取り出せることが知られている。以下では、接尾辞配列を、単に、SAと略称する場合がある。
 また、接尾辞配列についての逆配列を保持することで、接尾辞配列上から一つ右に存在する接尾辞配列を取り出すことができる。なお、配列Aの逆配列Vとは
 A[V[i]]=i
の関係が成立する配列である。
 接尾辞配列をSA[i]、その逆配列をINV[i]とする。
 このとき、接尾辞配列SA[i]は、辞書順でi番目にある接尾辞が元文字列上で何文字目に登場するかを表す。
 その逆配列INV[i]は、元文字列上でi番目に登場する接尾辞が、辞書順で何番目に登場するかを表す。
 この2つの配列を用いると、接尾辞配列上でi番目の接尾辞の一つ右の接尾辞は、
 INV[SA[i]+1]
を計算することで算出できる。
 例えば、図3のSA配列において、辞書順で上から三つ目にあるabraca$に対するSA[2](配列の添え字は0始まりであるとする)は1である。この値に1を足したSが登場する位置を逆配列から算出すると、4となり、一つ右の接尾辞braca$の位置を算出できる。
 また、接尾辞配列には、同じ接頭辞を持つ接尾辞の一つ右の接尾辞集合は接尾辞配列上で同じ順序で登場するという性質がある。例えば、文字列abracadabraに対する接尾辞配列を図4に示す。
 例えば、この接尾辞集合において、同じ接頭辞aで始まる接尾辞は1番から5番の接尾辞であり、元文字列上で10番目,7番目,0番目,3番目,5番目である。このリストの一つ右の接尾辞集合は、元文字列上で11番目,8番目,1番目,4番目,6番目の接尾辞の集合である。これらはそれぞれ、辞書順で、0番目,6番目,7番目,8番目,9番目に登場し、5つの接尾辞が、辞書順で同じ順番で並んでいることがわかる。
 接尾辞配列はすべての接尾辞を辞書順で並べ替えたものであり、同じ接頭辞を持つ接尾辞は、その接頭辞の次の文字で比較される。この例では、aが同じであるため、2文字目以降の順番で比較されるため、この順番は必ず保持される。
 以下、本発明の実施形態について、詳細に説明する。
[第1の実施形態]
 図2を参照すると、本発明の第1の実施形態に係る頻出系列の列挙装置20は、SA作成部21と、SA記憶部22と、SA頻度計算部23と、SA写像部24と、制御部25と、から成る。
 SA作成部21は、系列データ集合を入力とし、系列データの区切りを挿入しつつ、接尾辞配列と、接尾辞配列の逆配列と、元系列データとを生成する。
 SA記憶部22は、SA作成部21が作成した接尾辞配列と、接尾辞配列の逆配列と、元系列データとを記憶する。
 SA頻度計算部23は、SA記憶部22内のデータを参照し、データが辞書順に並んでいることを利用して頻度計算を行い、結果を制御部25に入力する。
 SA写像部24は、制御部25に指定されたポインタを元に、次に参照すべき接尾辞のポインタを算出する。
 SA記憶部22は、系列データについての全接尾辞を辞書順で並べた場合に、辞書上の位置から系列データ上の位置を参照可能にする接尾辞配列を記憶する。
 SA頻度計算部23は、接尾辞配列上の位置の集合を入力として、接尾辞配列によって参照される接尾辞に含まれる文字の数を、接尾辞が順序付けされていることを利用して数える。
 SA写像部24は、接尾辞配列内の位置の集合と、特定の文字を入力として、接尾辞配列内の値に1を加算し接尾辞配列の逆配列を参照することで、入力された文字に対する接尾辞配列上の登場位置を返す。
 また、SA頻度計算部23は、接尾辞配列と接尾辞配列上の位置の集合を入力として、元系列データ上に存在する文字の数を、接尾辞配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いる。
 次に、図5参照して、本第1の実施形態の頻出系列の列挙装置10の処理フローについて説明する。
 本処理フローは、SA作成部21に系列データの集合が入力されることで処理を開始する。
 SA作成部21はまず、系列データの集合を元に接尾辞配列を構築する(ステップf1)。
 この処理は次の手順で行われる。
 SA作成部21はまず、系列データに区切り文字を挿入しながら連結することで一つの文字列を作成する。
 例えば、
 {CAABC,ABCB,CABC,ABBCA}
という4つ系列データに対し、区切り文字を「%」末尾文字を「$」として作成した文字列は
 「CAABC%ABCB%CABC%ABBCA%$」
となる。
 次に、SA作成部21は、作成した文字列を元に接尾辞配列SAを構築する。そして、SA作成部21は、接尾辞配列SAを元に逆配列INVを構築する。
 SA作成部21は、結合して作成した文字列Tと、接尾辞配列SAと、逆配列INVの三つをSA記憶部21に格納する。
 SA記憶部22に三つの情報が記憶されると、SA頻度計算部23はSAを参照して頻度計算処理を行う(ステップf2)。
 この処理を図6に示す。
 頻度計算処理は、接尾辞配列SA、文字列T、対象接尾辞リストP、対象系列TSを入力として処理される。このうち、対象接尾辞リストPは接尾辞配列SAの添え字を表すリストである。対象系列TSはその時点でSA写像している元となった系列を表す。
 初回の実行時は、写像の元となった元が存在せず、すべての接尾辞を対象として実施されるため、それぞれ、0~全文字列長のidリストと、空文字列″″が入力となる。
 初回以外はSA写像部24がSA記憶部22内に保存した値を設定するものとする。
 この処理が開始されると、SA頻度計算部23はまず、P上のポインタを表す変数sを0として初期化する(初期化処理c01)。
 次に、SA頻度計算部25は、Pのs番目の添え字に対応する文字をT[P[SA[s]]]により取り出し、cとする(文字取り出し処理c02)。
 そして、SA頻度計算部23は、P上で先頭文字がcでなくなる位置を2分探索で取り出し,eとする(2分探索処理c03)。
 2分探索処理c03については、図7の処理によって実施される。
 2分探索処理c03は、文字cと対象接尾辞リストPを入力とし,P内の接尾辞において、先頭文字がcとなる最後の位置を2分探索アルゴリズムで実施する。
 この処理ではまず2分探索の探索範囲を0からPの長さで開始するため、SA頻度計算部23は、範囲を表すポインタlとuをそれぞれ0とPの長さにする(ステップb01)。
 次に、SA頻度計算部23は、その範囲の中間にある位置mを計算し(ステップb02)、その位置にある文字をT[SA[P[m]]]によって取り出し、tとする(ステップb03)。
 そして、SA頻度計算部23は、cとtの比較処理comp(c,t)を実施する(ステップb04)。
 比較処理comp(a,b)とは、文字aと文字bを辞書順で比較し、文字aと文字bが一致する場合に0,辞書順で文字a<文字bであれば1,辞書順で文字a>文字bであれば−1を返す処理である。
 comp(c,t)の結果が0以上である場合、先頭文字がcでなくなる位置は探索範囲の後ろに存在するため、SA頻度計算部23は、探索範囲の後ろ半分にするために開始位置lにmを代入する(ステップb05)。
 そうでない場合、SA頻度計算部23は、探索範囲の前半分にするために終了位置uにmを代入する(ステップb06)。
 次に、SA頻度計算部23は、終了判定(ステップb07)を行う。終了判定では探索範囲が1つの行に絞られているかどうかを判定するために、SA頻度計算部23は、l+1がuと等しいかどうかを確認する。そうであれば、SA頻度計算部23は、lを文字cが登場する最後の位置として出力し、そうでない場合、SA頻度計算部23は、ステップb02に戻り、同様の処理を行う。
 図6に戻って、この結果、接尾辞上で文字cが並ぶ位置はsからeと判定できるため、SA頻度計算部23は、文字cの頻度を(e−s)として算出し、文字cと頻度(e−s)、その開始位置s、終了位置eを出力する(出力処理c04)。
 なお、このとき文字cが%ないし$である場合は出力しないものとする。
 さらにSA頻度計算部23は次の文字の頻度計算を行うため、sをeとする(処理c05)。
 最後にSA頻度計算部23はポインタsがPの最後に到達したか否かを判定する(判定処理c06)。
 つまりs=P.lengthであれば、SA頻度計算部23は最後に到達したと判定し処理を終了する。
 そうでなければ、SA頻度計算部23は処理c02に戻り、同様の処理を行う。
 なお、頻度計算の処理は通常、対象接尾辞リストPのオーダ0(P.length)必要であるが、本アルゴリズムでは2分探索を用いているため、文字の種類数をWとした場合に0(W・log(P.length))の処理時間で実施することができる。
 しかしながら、文字の種類数Wに比べて対象接尾辞リストPの数が小さい場合、各文字を上から順に調べ頻度算出した方が高速に動作することがある。
 このため、SA頻度計算部23は、図5の頻度計算処理f2においては、P.lengthに対して予め閾値を設け、P.lengthがある程度小さい場合に、単純に頻度計算を行っても良い。
 図5に戻って、SA頻度計算部23は、この結果得られた文字、頻度、開始位置、終了位置のリスト、対象接尾辞リストP、対象系列TSを制御部25に入力する。
 次に、制御部25は、各文字の頻度のリストを元に、あらかじめ設定されている頻度の閾値以上の文字が含まれるかを判定する出力判定処理f3を行う。
 頻度の閾値以上の文字が含まれる場合、制御部25は、対象系列TSにその文字を文字列結合したものを出力し、対象系列TSにその文字をスタックに追加する(処理f4)。
 そして、制御部25は、スタックの一番上の系列を取り出す(処理f5)。
 このとき、もしスタックが空である場合、処理を終えるものとする。
 スタックから系列が得られた場合、制御部25は、対象接尾辞リストとその系列をSA写像部24に入力する。
 SA写像部24に系列S、対象接尾辞リストが入力されると、SA写像部24はDB写像処理を行う(処理f6)。
 SA写像部24は、系列Sと対象接尾辞リストを入力とし、SA記憶部22内の接尾辞配列内のデータを参照して、系列Sの最後の文字c以降に登場する接尾辞に対する接尾辞配列を構築し、対象接尾辞リストPを更新する処理を行う。
 この処理を、図8を用いて説明する。
 SA写像部24はまず、対象接尾辞リストPの逆配列P_INVを作成する(ステップS11)。
 この処理は、対象接尾辞リストに対して次の式により計算される。
 P_INV[P[i]]=i
 次に、SA写像部24は、出力対象接尾辞リストP内において、系列Sの最後の文字cを接頭辞とする接尾辞の開始位置と終了位置に含まれる接尾辞に対して、接尾辞配列の拡張を行う(ステップS12)。
 この処理を、図9を用いて説明する。
 この処理は、文字cの開始位置s、終了位置e、対象接尾辞リストP、対象接尾辞リストP_INVを用いて実施される。
 SA写像部24は、cで始まる接尾辞の添え字各i(s<=i<e)に対して、それぞれ次の処理を行う。
 1.iをポインタkに代入する(ステップS21)。
 2.一つ右にある文字の全接尾辞上の位置をn=INV[SA[P[k]]+1]により算出する(ステップS22)。
 3.その位置にある文字をT[SA[n]]により取り出し、区切り文字′%′か否かを判定する(ステップS23)。
 4.もしそうであれば処理を終える(ステップS24)。
 5.そうでなければ、その位置の対象接尾辞リスト上での位置を、P_INV[SA[P[k]]+1により算出し(ステップS25)、その位置を出力した上でステップS22に戻る。
 以上の処理により、現在の対象接尾辞リスト内において文字cで始まる文字列の次の位置をすべて取り出すことができる。
 図8に戻って、SA写像部24は、この処理によって得られた位置を新たに対象接尾辞リストとし、SA記憶部22に保存する。さらに、SA写像部24は系列Sを対象系列として保存する(ステップS13)。
 図5に戻って、SA記憶部22内の対象接尾辞リストが更新されると、SA頻度計算部23は新しい対象接尾辞リストPと対象系列TSを元に頻度計算処理を行う(頻度計算処理f2)。
 本頻出系列の列挙装置20は、この処理(f2からf6)を制御部25内のスタックが空になるまで繰り返すことで動作する。
 尚、頻出系列の列挙装置20の各部は、ハードウェアとソフトウェアの組み合わせを用いて実現すればよい。ハードウェアとソフトウェアとを組み合わせた形態では、RAM(random access memory)に列挙プログラムが展開され、該プログラムに基づいて制御部(CPU)等のハードウェアを動作させることによって、各部を各種手段として実現する。また、該プログラムは、記録媒体に記録されて頒布されても良い。当該記録媒体に記録されたプログラムは、有線、無線、又は記録媒体そのものを介して、メモリに読込まれ、制御部等を動作させる。尚、記録媒体を例示すれば、オプティカルディスクや磁気ディスク、半導体メモリ装置、ハードディスクなどが挙げられる。
 上記実施の形態を別の表現で説明すれば、頻出系列の列挙装置20として動作させる情報処理装置(コンピュータ)を、RAMに展開された列挙プログラムに基づき、SA作成部21、SA記憶部22、SA頻度計算部23、SA写像部24、および制御部25として動作させることで実現することが可能である。
 次に、本第1の実施形態の頻出系列の列挙装置20の効果について説明する。
 本第1の実施形態の頻出系列の列挙装置20では、接尾辞配列とその逆配列を用いてDB写像、頻度計算処理を行っている。接尾辞配列を用いると、ある文字より右に存在する接尾辞のリストを効率よく取り出せるため、繰り返しデータをコピーする必要がない。また、接尾辞は必ず辞書順で並んでいるため、この性質を利用して効率よく頻度を計算することができる。
[第2の実施形態]
 なお、上記第1の実施形態に係る列挙装置20では接尾辞配列とその逆配列を用い、一つ右の文字に移動するために2つの配列を参照し、INV[SA[i]+1]として移動したが、同様の考え方で接頭辞配列を用いることもできる。以下では、接頭辞配列を、単にPAと略称する場合がある。
 「接頭辞配列」とは、与えられた文字列内の全接頭辞を辞書順に並べ替えたものである。この場合、接頭辞配列PAとその逆配列INVを作成すると、INV[PA[i]−1]を参照することで、一つ右の文字の辞書上の位置を知ることが可能である。
 図10を参照すると、本発明の第2の実施形態に係る頻出系列の列挙装置20Aは、PA作成部21Aと、PA記憶部22Aと、PA頻度計算部23Aと、PA写像部24Aと、制御部25Aと、から成る。
 PA作成部21Aは、系列データ集合を入力とし、系列データの区切りを挿入しつつ、接頭辞配列と、接頭辞配列の逆配列と、元系列データとを生成する。
 PA記憶部22Aは、PA作成部21Aが作成した接頭辞配列と、接頭辞配列の逆配列と、元系列データとを記憶する。
 PA頻度計算部23Aは、PA記憶部22A内のデータを参照し、データが辞書順に並んでいることを利用して頻度計算を行い、結果を制御部25Aに入力する。
 PA写像部24Aは、制御部25Aに指定されたポインタを元に、次に参照すべき接頭辞のポインタを算出する。
 PA記憶部22Aは、系列データについての全接頭辞を辞書順で並べた場合に、辞書上の位置から系列データ上の位置を参照可能にする接頭辞配列を記憶する。
 PA頻度計算部23Aは、接頭辞配列上の位置の集合を入力として、接頭辞配列によって参照される接頭辞に含まれる文字の数を、接頭辞が順序付けされていることを利用して数える。
 PA写像部24Aは、接頭辞配列内の位置の集合と、特定の文字を入力として、接頭辞配列内の値から1を減算し接頭辞配列の逆配列を参照することで、入力された文字に対する接頭辞配列上の登場位置を返す。
 また、PA頻度計算部23Aは、接頭辞配列と接頭辞配列上の位置の集合を入力として、元系列データ上に存在する文字の数を、接頭辞配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いる。
 PA作成部21A、PA記憶部22A、PA頻度計算部23A、PA写像部24A、および制御部25Aの動作は、それぞれ、第1の実施形態におけるSA作成部21、SA記憶部22、SA頻度計算部23、SA写像部24、および制御部25と同様であるので、それらの詳細な動作説明については省略する。
 尚、頻出系列の列挙装置20Aの各部は、ハードウェアとソフトウェアの組み合わせを用いて実現すればよい。ハードウェアとソフトウェアとを組み合わせた形態では、RAM(random access memory)に列挙プログラムが展開され、該プログラムに基づいて制御部(CPU)等のハードウェアを動作させることによって、各部を各種手段として実現する。また、該プログラムは、記録媒体に記録されて頒布されても良い。当該記録媒体に記録されたプログラムは、有線、無線、又は記録媒体そのものを介して、メモリに読込まれ、制御部等を動作させる。尚、記録媒体を例示すれば、オプティカルディスクや磁気ディスク、半導体メモリ装置、ハードディスクなどが挙げられる。
 上記実施の形態を別の表現で説明すれば、頻出系列の列挙装置20Aとして動作させる情報処理装置(コンピュータ)を、RAMに展開された列挙プログラムに基づき、PA作成部21A、PA記憶部22A、PA頻度計算部23A、PA写像部24A、および制御部25Aとして動作させることで実現することが可能である。
 次に、本第2の実施形態の頻出系列の列挙装置20Aの効果について説明する。
 本第2の実施形態の頻出系列の列挙装置20Aでは、接頭辞配列とその逆配列を用いてDB写像、頻度計算処理を行っている。接頭辞配列を用いると、ある文字より右に存在する接頭辞のリストを効率よく取り出せるため、繰り返しデータをコピーする必要がない。また、接頭辞は必ず辞書順で並んでいるため、この性質を利用して効率よく頻度を計算することができる。
 なお、「接尾辞配列」および「接頭辞配列」は、「ソート済み配列」と総称される。
 したがって、上記第1の実施形態および上記第2の実施形態を上位概念で表した、頻出系列の列挙装置は、図示はしないが、ソート済み配列作成部と、ソート済み配列記憶部と、ソート済み配列頻度計算部と、ソート済み配列写像部と、制御部と、から成る。
 ソート済み配列記憶部は、系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に,辞書上の位置から元系列データ上の位置を参照可能にする。
 ソート済み配列頻度計算部は、ソート済み配列とソート済み配列上の位置の集合を入力として、元系列データ上に存在する文字の数を、ソート済み配列が順序付けされていることを利用して数える。
 ソート済み配列写像部は、ソート済み配列内の位置の集合と、特定の文字を入力として、ソート済み配列内の値に対する加減算を行い、入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込む。
 また、ソート済み配列頻度計算部は、ソート済み配列とソート済み配列上の位置の集合を入力として、元系列データ上に存在する文字の数を、ソート済み配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いる。
In order to facilitate understanding of the present invention, the suffix arrangement will be described. The “suffix array” is the data structure shown in Non-Patent Document 2 above.
The suffix array is an array that holds the positions when all suffixes appearing in a character string are arranged in dictionary order.
FIG. 3 shows an example of the suffix array of the character string abraca. The leftmost list in FIG. 3 represents all suffixes for the character string abraca. In this figure, for the sake of convenience, the character $ that comes first in the dictionary order is given at the end. Here, the symbol S i is assigned to each suffix. S i is a symbol that means a suffix after the i-th character. The list in the center of FIG. 3 is a list obtained by changing the list on the left in lexicographic order. Further, the list of integer values on the right side of FIG. 3 is obtained by extracting the suffix of the central suffix list, and this becomes a suffix array.
The suffix array indicates where the suffixes at arbitrary positions in the dictionary order exist on the original character string. The suffix array is a data structure mainly used for document search, and it is known that an arbitrary partial character string can be extracted by a log (character string). Hereinafter, the suffix arrangement may be simply abbreviated as SA.
In addition, by holding the reverse arrangement of the suffix array, the suffix array existing to the right of the suffix array can be extracted. Note that the reverse array V of array A is A [V [i]] = i
This is an array in which
The suffix sequence is SA [i], and the reverse sequence is INV [i].
At this time, the suffix array SA [i] represents the character number in the original character string that the i-th suffix in the dictionary order appears.
The inverse array INV [i] represents the number of the suffix that appears in the original character string in the dictionary order.
Using these two sequences, the suffix to the right of the i-th suffix on the suffix array is
INV [SA [i] +1]
Can be calculated by calculating.
For example, in the SA array of FIG. 3, SA [2] (assuming that the array subscript starts from 0) is 1 for abraca $ that is third from the top in the dictionary order. After calculating the position S 2 plus 1 to this value appears from the reverse sequence, can be calculated 4, and the suffix Braca $ position one right.
In addition, the suffix array has a property that a suffix set to the right of a suffix having the same prefix appears in the same order on the suffix array. For example, FIG. 4 shows a suffix array for a character string abracadara.
For example, in this suffix set, the suffixes starting with the same prefix a are the 1st to 5th suffixes, which are the 10th, 7th, 0th, 3rd and 5th on the original character string. The suffix set to the right of this list is the set of the 11th, 8th, 1st, 4th and 6th suffixes on the original character string. These appear in the 0th, 6th, 7th, 8th, and 9th in dictionary order, respectively, and it is understood that five suffixes are arranged in the same order in the dictionary order.
A suffix array is a lexicographical sort of all suffixes, and suffixes with the same prefix are compared with the next character after that prefix. In this example, since a is the same, since the comparison is made in the order of the second and subsequent characters, this order is always maintained.
Hereinafter, embodiments of the present invention will be described in detail.
[First Embodiment]
Referring to FIG. 2, the frequent sequence listing apparatus 20 according to the first embodiment of the present invention includes an SA creation unit 21, an SA storage unit 22, an SA frequency calculation unit 23, an SA mapping unit 24, and a control. Part 25.
The SA creation unit 21 receives a series data set as input, and generates a suffix array, a reverse array of the suffix array, and original series data while inserting a delimiter of the series data.
The SA storage unit 22 stores the suffix array created by the SA creation unit 21, the reverse arrangement of the suffix array, and the original series data.
The SA frequency calculation unit 23 refers to the data in the SA storage unit 22, calculates the frequency using the fact that the data is arranged in the dictionary order, and inputs the result to the control unit 25.
The SA mapping unit 24 calculates a suffix pointer to be referred to next based on the pointer specified by the control unit 25.
The SA storage unit 22 stores a suffix array that makes it possible to refer to the position on the series data from the position on the dictionary when all the suffixes about the series data are arranged in dictionary order.
The SA frequency calculation unit 23 receives a set of positions on the suffix array as input, and counts the number of characters included in the suffix referenced by the suffix array by using the fact that the suffixes are ordered.
The SA mapping unit 24 takes a set of positions in the suffix array and a specific character as input, adds 1 to the value in the suffix array, and refers to the reverse array of the suffix array, thereby inputting the input character Returns the position on the suffix array for.
Also, the SA frequency calculation unit 23 takes the suffix array and a set of positions on the suffix array as inputs, and uses the fact that the suffix array is ordered to determine the number of characters existing on the original series data. When counting, whether to perform a binary search is switched depending on the frequency.
Next, a processing flow of the frequent sequence listing apparatus 10 according to the first embodiment will be described with reference to FIG.
This processing flow starts when a set of series data is input to the SA creation unit 21.
First, the SA creation unit 21 constructs a suffix array based on a set of series data (step f1).
This process is performed in the following procedure.
First, the SA creating unit 21 creates one character string by connecting the series data while inserting a delimiter.
For example,
{CAABC, ABCB, CABC, ABBCA}
For the four series data, the character string created with “%” as the delimiter and “$” as the delimiter is “CAABC% ABCB% CABC% ABBCA% $”
It becomes.
Next, the SA creation unit 21 constructs a suffix array SA based on the created character string. Then, the SA creation unit 21 constructs a reverse array INV based on the suffix array SA.
The SA creation unit 21 stores three of the combined character string T, suffix array SA, and reverse array INV in the SA storage unit 21.
When three pieces of information are stored in the SA storage unit 22, the SA frequency calculation unit 23 performs frequency calculation processing with reference to the SA (step f2).
This process is shown in FIG.
The frequency calculation process is processed with the suffix array SA, the character string T, the target suffix list P, and the target series TS as inputs. Among these, the target suffix list P is a list representing the subscripts of the suffix array SA. The target sequence TS represents the sequence that is the source of SA mapping at that time.
At the first execution, there is no mapping source, and all suffixes are targeted. Therefore, the id list of 0 to the total string length and the empty string "" are input. Become.
Except for the first time, the SA mapping unit 24 sets values stored in the SA storage unit 22.
When this process is started, the SA frequency calculation unit 23 first initializes a variable s representing a pointer on P as 0 (initialization process c01).
Next, the SA frequency calculation unit 25 extracts the character corresponding to the sth subscript of P by T [P [SA [s]]] and sets it as c (character extraction processing c02).
Then, the SA frequency calculation unit 23 extracts a position where the first character is not c on P by binary search, and sets it as e (binary search processing c03).
The binary search process c03 is performed by the process of FIG.
The binary search process c03 receives the character c and the target suffix list P as input, and executes the final position where the first character is c in the suffix in P by the binary search algorithm.
In this process, first, the search range of the binary search is started with a length of 0 to P. Therefore, the SA frequency calculation unit 23 sets the pointers l and u representing the ranges to the lengths of 0 and P, respectively (step b01). .
Next, the SA frequency calculation unit 23 calculates a position m that is in the middle of the range (step b02), extracts a character at that position by T [SA [P [m]]], and sets it to t (step b02). b03).
Then, the SA frequency calculation unit 23 performs a comparison process comp (c, t) between c and t (step b04).
The comparison process comp (a, b) compares the character a and the character b in the dictionary order, 0 if the character a and the character b match, 1 if the character a <character b in the dictionary order, 1 in the dictionary order If character a> character b, the process returns -1.
When the result of comp (c, t) is 0 or more, since the position where the first character is not c exists after the search range, the SA frequency calculation unit 23 sets the start position to make the latter half of the search range. m is substituted for l (step b05).
Otherwise, the SA frequency calculation unit 23 substitutes m for the end position u to make the first half of the search range (step b06).
Next, the SA frequency calculation unit 23 performs an end determination (step b07). In the end determination, the SA frequency calculation unit 23 checks whether l + 1 is equal to u in order to determine whether the search range is narrowed down to one line. If so, the SA frequency calculation unit 23 outputs l as the last position where the character c appears, otherwise, the SA frequency calculation unit 23 returns to step b02 and performs the same processing.
Returning to FIG. 6, as a result, the position where the character c is arranged on the suffix can be determined as s to e, so the SA frequency calculation unit 23 calculates the frequency of the character c as (es), and the character c And the frequency (es), the start position s, and the end position e are output (output process c04).
At this time, if the character c is% or $, it is not output.
Further, since the SA frequency calculation unit 23 calculates the frequency of the next character, s is set to e (processing c05).
Finally, the SA frequency calculation unit 23 determines whether or not the pointer s has reached the end of P (determination process c06).
That is, s = P. If it is length, the SA frequency calculation unit 23 determines that it has reached the end and ends the process.
Otherwise, the SA frequency calculation unit 23 returns to the process c02 and performs the same process.
Note that the frequency calculation process normally requires order 0 (P. length) of the target suffix list P, but since this algorithm uses a binary search, 0 is assumed when the number of character types is W. (W · log (P. length)) processing time can be implemented.
However, when the number of target suffix lists P is smaller than the number of character types W, it may operate faster if the frequency of each character is examined in order from the top.
For this reason, the SA frequency calculation unit 23 performs P.P. A threshold is set in advance for the length. If the length is small to some extent, the frequency calculation may be simply performed.
Returning to FIG. 5, the SA frequency calculation unit 23 inputs the resulting character, frequency, start position, end position list, target suffix list P, and target sequence TS to the control unit 25.
Next, the control unit 25 performs an output determination process f3 for determining whether or not a character having a frequency equal to or higher than a preset frequency threshold is included based on the frequency list of each character.
When a character equal to or higher than the frequency threshold is included, the control unit 25 outputs the character string connected to the target sequence TS and adds the character to the stack to the target sequence TS (processing f4).
Then, the control unit 25 takes out the top series of the stack (processing f5).
At this time, if the stack is empty, the processing is terminated.
When a sequence is obtained from the stack, the control unit 25 inputs the target suffix list and the sequence to the SA mapping unit 24.
When the sequence S and the target suffix list are input to the SA mapping unit 24, the SA mapping unit 24 performs DB mapping processing (processing f6).
The SA mapping unit 24 receives the sequence S and the target suffix list as input, and refers to the data in the suffix array in the SA storage unit 22 to suffix the suffix appearing after the last character c in the sequence S. An array is constructed and the target suffix list P is updated.
This process will be described with reference to FIG.
First, the SA mapping unit 24 creates an inverse array P_INV of the target suffix list P (step S11).
This process is calculated by the following formula for the target suffix list.
P_INV [P [i]] = i
Next, the SA mapping unit 24 adds a suffix array to the suffixes included in the start position and the end position of the suffix prefixed with the last character c of the sequence S in the output target suffix list P. Is expanded (step S12).
This process will be described with reference to FIG.
This process is performed using the start position s, end position e, target suffix list P, and target suffix list P_INV of the character c.
The SA mapping unit 24 performs the following process for each suffix i (s <= i <e) starting with c.
1. i is substituted into the pointer k (step S21).
2. The position on the entire suffix of the character on the right is calculated by n = INV [SA [P [k]] + 1] (step S22).
3. The character at that position is extracted by T [SA [n]] and it is determined whether or not it is a delimiter character “%” (step S23).
4). If so, the process ends (step S24).
5. Otherwise, the position of the position on the target suffix list is calculated by P_INV [SA [P [k]] + 1 (step S25), the position is output, and the process returns to step S22.
With the above processing, all the next positions of the character string starting with the character c in the current target suffix list can be extracted.
Returning to FIG. 8, the SA mapping unit 24 makes the position obtained by this processing a new target suffix list and stores it in the SA storage unit 22. Further, the SA mapping unit 24 stores the sequence S as a target sequence (step S13).
Returning to FIG. 5, when the target suffix list in the SA storage unit 22 is updated, the SA frequency calculation unit 23 performs frequency calculation processing based on the new target suffix list P and the target series TS (frequency calculation processing). f2).
The frequent sequence listing device 20 operates by repeating this process (f2 to f6) until the stack in the control unit 25 becomes empty.
Each unit of the frequent sequence listing device 20 may be realized by using a combination of hardware and software. In the form of a combination of hardware and software, enumeration programs are expanded in RAM (random access memory), and each part is realized as various means by operating hardware such as a control unit (CPU) based on the program. To do. Further, the program may be recorded on a recording medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
To describe the above-described embodiment in another expression, an information processing device (computer) that operates as a frequent sequence enumeration device 20 is based on an enumeration program expanded in a RAM, an SA creation unit 21, an SA storage unit 22, This can be realized by operating as the SA frequency calculation unit 23, the SA mapping unit 24, and the control unit 25.
Next, the effect of the frequent sequence listing device 20 of the first embodiment will be described.
In the frequent sequence enumeration apparatus 20 of the first embodiment, DB mapping and frequency calculation processing are performed using a suffix array and its inverse array. If a suffix array is used, a list of suffixes existing to the right of a certain character can be efficiently extracted, so that it is not necessary to copy repeated data. Since suffixes are always arranged in dictionary order, the frequency can be calculated efficiently using this property.
[Second Embodiment]
The enumeration apparatus 20 according to the first embodiment uses a suffix array and its reverse array, refers to two arrays to move to the right one character, and designates INV [SA [i] +1]. Moved, but you can use prefix sequences in a similar way. Hereinafter, the prefix array may be simply abbreviated as PA.
The “prefix array” is obtained by rearranging all the prefixes in a given character string in dictionary order. In this case, when the prefix array PA and its inverse array INV are created, it is possible to know the position of the character on the right one in the dictionary by referring to INV [PA [i] -1].
Referring to FIG. 10, a frequent sequence listing device 20A according to the second embodiment of the present invention includes a PA creation unit 21A, a PA storage unit 22A, a PA frequency calculation unit 23A, a PA mapping unit 24A, and a control. 25A.
The PA creation unit 21A receives a series data set as input, and generates a prefix array, a reverse array of the prefix array, and original series data while inserting a series data break.
The PA storage unit 22A stores the prefix array created by the PA creation unit 21A, the reverse array of the prefix array, and the original series data.
The PA frequency calculation unit 23A refers to the data in the PA storage unit 22A, performs frequency calculation using the fact that the data is arranged in dictionary order, and inputs the result to the control unit 25A.
The PA mapping unit 24A calculates a prefix pointer to be referred to next based on the pointer designated by the control unit 25A.
The PA storage unit 22A stores a prefix array that makes it possible to refer to the position on the series data from the position on the dictionary when all the prefixes about the series data are arranged in the dictionary order.
The PA frequency calculation unit 23A receives a set of positions on the prefix array as an input, and counts the number of characters included in the prefix referenced by the prefix array using the fact that the prefixes are ordered.
The PA mapping unit 24A takes a set of positions in the prefix array and a specific character as input, subtracts 1 from the value in the prefix array, and refers to the reverse array of the prefix array, thereby inputting the input character. Returns the position on the prefix array for.
Further, the PA frequency calculation unit 23A takes the prefix array and a set of positions on the prefix array as inputs, and uses the fact that the prefix array is ordered to determine the number of characters existing on the original series data. When counting, whether to perform a binary search is switched depending on the frequency.
The operations of the PA creation unit 21A, PA storage unit 22A, PA frequency calculation unit 23A, PA mapping unit 24A, and control unit 25A are the SA creation unit 21, SA storage unit 22, and SA frequency calculation in the first embodiment, respectively. Since it is the same as that of the unit 23, the SA mapping unit 24, and the control unit 25, the detailed operation description thereof will be omitted.
Note that each unit of the frequent sequence listing device 20A may be realized by using a combination of hardware and software. In the form of a combination of hardware and software, enumeration programs are expanded in RAM (random access memory), and each part is realized as various means by operating hardware such as a control unit (CPU) based on the program. To do. Further, the program may be recorded on a recording medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
To describe the above-described embodiment in another expression, an information processing device (computer) that operates as a frequent-sequence enumeration device 20A is based on a PA creation unit 21A, a PA storage unit 22A, This can be realized by operating as the PA frequency calculation unit 23A, the PA mapping unit 24A, and the control unit 25A.
Next, the effect of the frequent sequence listing apparatus 20A of the second embodiment will be described.
In the frequent sequence enumeration apparatus 20A of the second embodiment, DB mapping and frequency calculation processing are performed using a prefix array and its reverse array. If a prefix array is used, a list of prefixes existing to the right of a certain character can be efficiently extracted, so there is no need to copy repeated data. Moreover, since the prefixes are always arranged in the dictionary order, the frequency can be calculated efficiently using this property.
The “suffix array” and “prefix array” are collectively referred to as “sorted array”.
Accordingly, the frequent sequence enumeration apparatus that represents the first embodiment and the second embodiment in a high-level concept is not shown, but the sorted array creation unit, the sorted array storage unit, and the sorted It consists of an array frequency calculation unit, a sorted array mapping unit, and a control unit.
The sorted array storage unit makes it possible to refer to the position on the original series data from the position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order.
The sorted array frequency calculation unit receives the set of the sorted array and the position on the sorted array as input, and counts the number of characters existing in the original series data using the fact that the sorted array is ordered. .
The sorted array mapping unit takes a set of positions in the sorted array and a specific character as input, and adds or subtracts the value in the sorted array, and the appearance position of the suffix or prefix adjacent to the input character To narrow down the set of target positions.
In addition, the sorted array frequency calculation unit uses a sorted array and a set of positions on the sorted array as inputs, and uses the number of characters existing in the original series data to order the sorted arrays. When counting, whether to perform a binary search is switched depending on the frequency.
 次に、図2に示した第1の実施形態に係る、頻出系列の列挙装置10の実施例について説明する。
 ここでは、頻度の閾値2において、系列データベースとして次の4つ系列が得られたとする。
 CAABCC
 ABCB
 CABC
 ABBCA
 このとき、SA作成部21が区切り文字%と末尾文字$を挿入し、次の文字列
 「CAABC%ABCB%CABC%ABBCA%$」
を作成し、図11の接尾辞配列を作成する(図5のSA作成処理f1)。
 なお、図11では分かりやすくするために各接尾辞を示しているが、実施には接尾辞自体は保持されていないものとする。
 図11の接尾辞配列SAが得られると、SA頻度計算部23は2分探索処理を行い、各文字の頻度計算を行う(図5の頻度計算処理f2)。
 この場合、対象接尾辞リストPは0~23,対象系列TSは空文字列″″となる。
 この結果を図12に示す。図12は文字A、B、Cについてそれぞれ頻度と全SA上での開始位置、終了位置が出力されている。
 SA頻度計算部23は、対象接尾辞リストPは0~23,対象系列TSは空文字列″″、図12のデータを制御部25に出力する。
 次に、制御部25は、図12のデータを調べ、文字A,B,Cを出力する(図5の出力判定処理f3)と共に、スタックに「C」,「B」,「A」を追加する(図5のステップf4)。
 制御部25はさらに、この状態でスタックの上位にあるAを取り出し(図5のステップf5)、SA写像部24対して対象接尾辞リストP(0~23),系列(A)を入力する。
 SA写像部24は、これらを元にSA写像を行う(図5のSA写像処理f6)。
 この処理ではまず、SA写像部24は、Pに対する逆配列を作成する(図8のステップS11)。
 なお、現状のPはP[i]=iであるため、P_INV[i]=P[i]である。
 この上で、SA写像部24は、接尾辞配列の拡張処理を行う(図8のステップS12)。この処理において、SA写像部24は、現状のPにおいて、Aで始まる接尾辞配列のリストを作成する。
 従って、SA写像部24は、図11における6~12番目の位置から右方向への拡張を行う。
 例えば、i=6に対する次の位置nは、
 n=INV[SA[P[6]]+1]=INV[SA[6]+1]=INV[22]=1
となる。
 i=6の一つ右の接尾辞はi=1にあり、T[SA[1]]=′%′であるため、以降の処理は行われない。
 次に、i=7についての例を説明する。
 i=7に対する次の位置nは、
 n=INV[SA[P[7]]+1]=INV[SA[7]+1]=INV[3]=10
となる。
 T[SA[10]]=Aであり、このとき、SA写像部24は、
 k=P_INV[10]=10
を出力する。
 同様の処理を繰り返すことで、対象系列Aを元とした対象接尾辞配列リストとして
 {6,10,12,13,14,15,16,17,18,19,20,23}
がSA記憶部22に出力される。
 図13に、これらの行のみを取り出した擬似的な部分接尾辞配列図を示す。
 次に、SA頻度計算部23はこれらのデータを元に頻度計算を行う(図5の頻度計算処理f2)。
 この結果を図14に示す。図14から、対象系列Aに続く文字としては、Aが2回、Bが6回、Cが6回であることがわかる。よって制御手段25はこれらと対象系列Aを結合し、AA,AB,ACを出力する。
 さらに,制御部25はスタック{C,B}にこれらを追加し、スタックを
 {C,B,AC,AB,AA}
とする(図5の出力判定処理f3)。
 そして、制御部25は一番上にある系列AAを取り出し、再び同様の処理を行う。
 すなわち、次のSA写像により、図15の対象接尾辞リストが得られる。
 しかし、この中にはBが1回、Cが1回しか含まれないため、制御部25は何も出力せず、スタックの次の系列ABを対象とした処理を行う。
 この場合、SA写像の結果は図16になり、ABBが3回、ABCが4回という出力を得る。
 本列挙装置20は、このような再帰的計算をDB写像ではなく、接尾辞配列を用いて実施する。写像自体の計算時間は従来と同じくデータ長であるが、写像後の頻度計算を、2分探索を行って実施できるため、効率よく動作させることができる。
 以上、実施例を参照して本発明を説明したが、本発明は上記実施例に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。
 上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
(付記1) 系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に,辞書上の位置から元系列データ上の位置を参照可能にするソート済み配列を記憶するソート済み配列記憶部と、
 前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数えるソート済み配列頻度計算部と、
 前記ソート済み配列内の位置の集合と、特定の文字を入力として、前記ソート済み配列内の値に対する加減算を行い、前記入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込むソート済み配列写像部と、
を備える、頻出系列の列挙装置。
(付記2) 前記ソート済み配列頻度計算部は、前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いる、付記1に記載の頻出系列の列挙装置。
(付記3) 前記ソート済み配列は接尾辞配列から成り、
 系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接尾辞配列と、該接尾辞配列の逆配列と、前記元系列データとを生成する接尾辞配列作成部を更に備える、付記1又は2に記載の頻出系列の列挙装置。
(付記4) 前記ソート済み配列記憶部は、前記接尾辞配列作成部が生成した、前記接尾辞配列と、前記接尾辞配列の逆配列と、前記元系列データとを記憶する接尾辞配列記憶部から成る、付記3に記載の頻出系列の列挙装置。
(付記5) 前記ソート済み配列は接頭辞配列から成り、
 系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接頭辞配列と、該接頭辞配列の逆配列と、前記元系列データとを生成する接頭辞配列作成部を更に備える、付記1又は2に記載の頻出系列の列挙装置。
(付記6) 前記ソート済み配列記憶部は、前記接頭辞配列作成部が生成した、前記接頭辞配列と、前記接頭辞配列の逆配列と、前記元系列データとを記憶する接頭辞配列記憶部から成る、付記5に記載の頻出系列の列挙装置。
(付記7) 列挙装置を用いて、データ列内から高頻度な系列を列挙する方法であって、
 系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に,辞書上の位置から元系列データ上の位置を参照可能にするソート済み配列を、ソート済み配列記憶部に格納する格納ステップと、
 前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える計算ステップと、
 前記ソート済み配列内の位置の集合と、特定の文字を入力として、前記ソート済み配列内の値に対する加減算を行い、前記入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込む写像ステップと、
を備える、頻出系列の列挙方法。
(付記8) 前記計算ステップは、前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いる、付記7に記載の頻出系列の列挙方法。
(付記9) 前記ソート済み配列は接尾辞配列から成り、
 系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接尾辞配列と、該接尾辞配列の逆配列と、前記元系列データとを生成する作成ステップを更に備える、付記7又は8に記載の頻出系列の列挙方法。
(付記10) 前記格納ステップは、前記ソート済み配列記憶部である接尾辞配列記憶部に、前記作成ステップで生成した、前記接尾辞配列と、前記接尾辞配列の逆配列と、前記元系列データとを格納する、付記9に記載の頻出系列の列挙方法。
(付記11) 前記ソート済み配列は接頭辞配列から成り、
 系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接頭辞配列と、該接頭辞配列の逆配列と、前記元系列データとを生成する作成ステップを更に備える、付記7又は8に記載の頻出系列の列挙方法。
(付記12) 前記格納ステップは、前記ソート済み配列記憶部である接頭辞配列記憶部に、前記作成ステップで生成した、前記接頭辞配列と、前記接頭辞配列の逆配列と、前記元系列データとを格納する、付記11に記載の頻出系列の列挙方法。
(付記13) データ列内から高頻度な系列を、コンピュータに列挙させるプログラムを記録したコンピュータ読取可能な記録媒体であって、前記プログラムは前記コンピュータに、
 系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に,辞書上の位置から元系列データ上の位置を参照可能にするソート済み配列を、ソート済み配列記憶部に格納する格納手順と、
 前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える計算手順と、
 前記ソート済み配列内の位置の集合と、特定の文字を入力として、前記ソート済み配列内の値に対する加減算を行い、前記入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込む写像手順と、
を実行させる、コンピュータ読取可能な記録媒体。
(付記14) 前記計算手順は、前記コンピュータに、前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いさせる、ことを特徴とする付記13に記載のコンピュータ読取可能な記録媒体。
(付記15) 前記ソート済み配列は接尾辞配列から成り、
 前記プログラムは前記コンピュータに、系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接尾辞配列と、該接尾辞配列の逆配列と、前記元系列データとを生成する作成手順を更に実行させる、付記13又は14に記載のコンピュータ読取可能な記録媒体。
(付記16) 前記格納手順は、前記コンピュータに、前記ソート済み配列記憶部である接尾辞配列記憶部に、前記作成ステップで生成した、前記接尾辞配列と、前記接尾辞配列の逆配列と、前記元系列データとを格納させる、付記15に記載のコンピュータ読取可能な記録媒体。
(付記17) 前記ソート済み配列は接頭辞配列から成り、
 前記プログラムは前記コンピュータに、系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接頭辞配列と、該接頭辞配列の逆配列と、前記元系列データとを生成する作成手順を更に実行させる、付記13又は14に記載のコンピュータ読取可能な記録媒体。
(付記18) 前記格納手順は、前記コンピュータに、前記ソート済み配列記憶部である接頭辞配列記憶部に、前記作成ステップで生成した、前記接頭辞配列と、前記接頭辞配列の逆配列と、前記元系列データとを格納させる、付記17に記載のコンピュータ読取可能な記録媒体。
Next, an example of the frequent sequence listing apparatus 10 according to the first embodiment shown in FIG. 2 will be described.
Here, it is assumed that the following four sequences are obtained as the sequence database at the frequency threshold 2.
CAABCC
ABCB
CABC
ABBCA
At this time, the SA creation unit 21 inserts the delimiter character% and the trailing character $, and the next character string “CAABC% ABCB% CABC% ABBCA% $”.
And the suffix array of FIG. 11 is created (SA creation processing f1 of FIG. 5).
In FIG. 11, each suffix is shown for easy understanding, but it is assumed that the suffix itself is not held in practice.
When the suffix array SA of FIG. 11 is obtained, the SA frequency calculation unit 23 performs a binary search process and calculates the frequency of each character (frequency calculation process f2 of FIG. 5).
In this case, the target suffix list P is 0 to 23, and the target series TS is an empty character string “”.
The result is shown in FIG. In FIG. 12, the frequency and the start position and end position on all SAs are output for the characters A, B, and C, respectively.
The SA frequency calculation unit 23 outputs 0 to 23 for the target suffix list P, the empty character string "" for the target series TS, and the data of FIG.
Next, the control unit 25 examines the data in FIG. 12, outputs characters A, B, and C (output determination processing f3 in FIG. 5), and adds “C”, “B”, and “A” to the stack. (Step f4 in FIG. 5).
In this state, the control unit 25 further extracts A at the top of the stack (step f5 in FIG. 5), and inputs the target suffix list P (0 to 23) and the sequence (A) to the SA mapping unit 24.
The SA mapping unit 24 performs SA mapping based on these (SA mapping processing f6 in FIG. 5).
In this process, first, the SA mapping unit 24 creates an inverse array for P (step S11 in FIG. 8).
Note that P_INV [i] = P [i] because the current P is P [i] = i.
Then, the SA mapping unit 24 performs a suffix array expansion process (step S12 in FIG. 8). In this process, the SA mapping unit 24 creates a suffix array list starting with A in the current P.
Therefore, the SA mapping unit 24 extends rightward from the 6th to 12th positions in FIG.
For example, the next position n for i = 6 is
n = INV [SA [P [6]] + 1] = INV [SA [6] +1] = INV [22] = 1
It becomes.
Since the suffix to the right of i = 6 is i = 1 and T [SA [1]] = '%', the subsequent processing is not performed.
Next, an example for i = 7 will be described.
The next position n for i = 7 is
n = INV [SA [P [7]] + 1] = INV [SA [7] +1] = INV [3] = 10
It becomes.
T [SA [10]] = A. At this time, the SA mapping unit 24
k = P_INV [10] = 10
Is output.
By repeating the same process, {6, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 23} is obtained as a target suffix array list based on the target sequence A.
Is output to the SA storage unit 22.
FIG. 13 shows a pseudo partial suffix arrangement diagram in which only these lines are extracted.
Next, the SA frequency calculation unit 23 performs frequency calculation based on these data (frequency calculation processing f2 in FIG. 5).
The result is shown in FIG. As can be seen from FIG. 14, as the characters following the target series A, A is 2 times, B is 6 times, and C is 6 times. Therefore, the control means 25 combines these and the target series A, and outputs AA, AB, and AC.
Further, the control unit 25 adds these to the stack {C, B}, and adds the stack to {C, B, AC, AB, AA}.
(Output determination process f3 in FIG. 5).
Then, the control unit 25 takes out the uppermost series AA and performs the same process again.
That is, the target suffix list of FIG. 15 is obtained by the next SA mapping.
However, since B is included only once and C is included only once, the control unit 25 does not output anything and performs processing for the next sequence AB in the stack.
In this case, the result of SA mapping is as shown in FIG. 16, and outputs of ABB 3 times and ABC 4 times are obtained.
The enumeration apparatus 20 performs such recursive calculation using a suffix array instead of a DB map. The calculation time of the mapping itself is the data length as in the conventional case, but since the frequency calculation after mapping can be performed by performing a binary search, it can be operated efficiently.
While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Supplementary note 1) Sorted array storage for storing a sorted array that allows reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order And
Sorted array frequency that takes the sorted array and a set of positions on the sorted array as input, and counts the number of characters existing on the original series data by using the fact that the sorted array is ordered A calculation unit;
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A sorted array mapping section that narrows down the set of target positions;
An enumeration apparatus for frequent sequences.
(Supplementary Note 2) The sorted array frequency calculation unit uses the sorted array and a set of positions on the sorted array as inputs, and the sorted array orders the number of characters existing on the original series data. The enumeration apparatus for frequent sequences according to appendix 1, wherein when enumerating what is being performed, the binary search is used depending on the frequency.
(Supplementary note 3) The sorted array consists of a suffix array,
The apparatus further comprises a suffix array creation unit that generates the suffix array, the reverse array of the suffix array, and the original series data while inserting the series data delimiter using the series data set as an input. The frequent sequence listing apparatus according to 1 or 2.
(Supplementary Note 4) The sorted array storage unit is a suffix array storage unit that stores the suffix array, the reverse array of the suffix array, and the original series data generated by the suffix array generation unit. The frequent-sequence enumeration apparatus according to appendix 3, comprising:
(Supplementary note 5) The sorted array consists of a prefix array,
The system further comprises: a prefix array creating unit that generates the prefix array, the reverse array of the prefix array, and the original series data while receiving the series data set as an input and inserting a break of the series data The frequent sequence listing apparatus according to 1 or 2.
(Supplementary Note 6) The sorted array storage unit is a prefix array storage unit that stores the prefix array, the reverse array of the prefix array, and the original series data generated by the prefix array creation unit. The frequent-sequence enumeration apparatus according to appendix 5, comprising:
(Supplementary note 7) A method of enumerating a high-frequency series from within a data string using an enumeration device,
A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Steps,
A calculation step of counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping step for narrowing down the set of positions of interest;
A method for enumerating frequent sequences.
(Supplementary Note 8) In the calculation step, the sorted array is ordered with the number of characters existing on the original series data, with the sorted array and a set of positions on the sorted array as inputs. 8. The frequent sequence enumeration method according to appendix 7, in which whether or not to perform a binary search is switched according to frequency when counting using.
(Supplementary note 9) The sorted array consists of a suffix array,
Appendix 7 or 8 further comprising a creation step of generating the suffix array, the reverse array of the suffix array, and the original series data while inputting the series data set as an input and inserting a break of the series data The enumeration method of the frequent series described in.
(Supplementary Note 10) In the storage step, the suffix array generated in the creation step, the reverse array of the suffix array, and the original sequence data are generated in the suffix array storage section which is the sorted array storage section. The frequent sequence listing method according to appendix 9, wherein:
(Supplementary note 11) The sorted array consists of a prefix array,
Supplementary note 7 or 8 further comprising a creation step of generating the prefix array, the reverse array of the prefix array, and the original series data while inputting the series data set as an input and inserting a break of the series data The enumeration method of the frequent series described in.
(Supplementary Note 12) In the storage step, the prefix array, the reverse array of the prefix array, and the original series data generated in the creation step are stored in the prefix array storage section which is the sorted array storage section. And the frequent sequence listing method according to attachment 11.
(Supplementary note 13) A computer-readable recording medium recording a program for causing a computer to list a high-frequency sequence from a data string, wherein the program is stored in the computer,
A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Procedure and
A calculation procedure for counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping procedure for narrowing down the set of target positions;
A computer-readable recording medium for executing
(Supplementary Note 14) In the calculation procedure, the sorted array orders the number of characters existing on the original series data by inputting the sorted array and a set of positions on the sorted array to the computer. 14. The computer-readable recording medium according to appendix 13, wherein, when counting is performed, whether to perform a binary search is switched depending on frequency.
(Supplementary note 15) The sorted array consists of a suffix array,
The program has the creation procedure of generating the suffix array, the reverse array of the suffix array, and the original series data while inserting the series data delimiter into the computer while inputting the series data set. The computer-readable recording medium according to appendix 13 or 14, further executed.
(Supplementary Note 16) In the storage procedure, the suffix array generated in the creation step, the reverse array of the suffix array, and the suffix array generated in the suffix array storage unit that is the sorted array storage unit, The computer-readable recording medium according to appendix 15, which stores the original series data.
(Supplementary note 17) The sorted array consists of a prefix array,
The program has the creation procedure of generating the prefix array, the reverse array of the prefix array, and the original series data while inserting the series data delimiter into the computer while inputting the series data set. The computer-readable recording medium according to appendix 13 or 14, further executed.
(Supplementary Note 18) In the storage procedure, in the computer, the prefix array storage unit that is the sorted array storage unit, the prefix array generated in the creation step, the reverse array of the prefix array, 18. The computer-readable recording medium according to appendix 17, which stores the original series data.
 本発明は、購買ログの分析、DNAの分析、テキストデータの分析、ログデータにおいて、頻出する特徴的なパターンを高速に算出するのに利用可能である。 The present invention can be used to quickly calculate characteristic patterns that frequently appear in analysis of purchase logs, analysis of DNA, analysis of text data, and log data.
 20、20A  頻出系列の列挙装置
 21  SA作成部
 21A  PA作成部
 22  SA記憶部(ソート済み配列記憶部)
 22A  PA記憶部(ソート済み配列記憶部)
 23  SA頻度計算部(ソート済み配列頻度計算部)
 23A  PA頻度計算部(ソート済み配列頻度計算部)
 24  SA写像部(ソート済み配列写像部)
 24A  PA写像部(ソート済み配列写像部)
 25、25A  制御部
 この出願は、2013年8月23日に出願された日本出願特願2013−173134号を基礎とする優先権を主張し、その開示の全てをここに取り込む。
20, 20A Frequent sequence listing device 21 SA creation unit 21A PA creation unit 22 SA storage unit (sorted array storage unit)
22A PA storage unit (sorted array storage unit)
23 SA frequency calculator (sorted array frequency calculator)
23A PA frequency calculator (sorted array frequency calculator)
24 SA mapping part (sorted array mapping part)
24A PA mapping part (sorted array mapping part)
25, 25A Control Unit This application claims priority based on Japanese Patent Application No. 2013-173134 filed on August 23, 2013, the entire disclosure of which is incorporated herein.

Claims (10)

  1.  系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に,辞書上の位置から元系列データ上の位置を参照可能にするソート済み配列を記憶するソート済み配列記憶部と、
     前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数えるソート済み配列頻度計算部と、
     前記ソート済み配列内の位置の集合と、特定の文字を入力として、前記ソート済み配列内の値に対する加減算を行い、前記入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込むソート済み配列写像部と、
    を備える、頻出系列の列挙装置。
    A sorted array storage unit for storing a sorted array that enables reference to a position on the original series data from a position on the dictionary when all suffixes or all prefixes about the series data are arranged in dictionary order;
    Sorted array frequency that takes the sorted array and a set of positions on the sorted array as input, and counts the number of characters existing on the original series data by using the fact that the sorted array is ordered A calculation unit;
    A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A sorted array mapping section that narrows down the set of target positions;
    An enumeration apparatus for frequent sequences.
  2.  前記ソート済み配列頻度計算部は、前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いる、請求項1に記載の頻出系列の列挙装置。 The sorted array frequency calculation unit has the sorted array and the set of positions on the sorted array as inputs, and the sorted array is ordered by the number of characters existing in the original series data. The enumeration apparatus for frequent sequences according to claim 1, wherein, when counting by using, switching whether to perform a binary search according to frequency is used.
  3.  前記ソート済み配列は接尾辞配列から成り、
     系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接尾辞配列と、該接尾辞配列の逆配列と、前記元系列データとを生成する接尾辞配列作成部を更に備える、請求項1又は2に記載の頻出系列の列挙装置。
    The sorted array comprises a suffix array;
    The system further comprises a suffix array creation unit for generating the suffix array, the reverse array of the suffix array, and the original series data while inserting a series data set as an input and inserting a break of the series data. Item 3. An apparatus for enumerating frequent sequences according to Item 1 or 2.
  4.  前記ソート済み配列記憶部は、前記接尾辞配列作成部が生成した、前記接尾辞配列と、前記接尾辞配列の逆配列と、前記元系列データとを記憶する接尾辞配列記憶部から成る、請求項3に記載の頻出系列の列挙装置。 The sorted array storage unit includes a suffix array storage unit that stores the suffix array generated by the suffix array creation unit, a reverse array of the suffix array, and the original series data. Item 4. The frequent sequence listing device according to Item 3.
  5.  前記ソート済み配列は接頭辞配列から成り、
     系列データ集合を入力として、前記系列データの区切りを挿入しつつ、前記接頭辞配列と、該接頭辞配列の逆配列と、前記元系列データとを生成する接頭辞配列作成部を更に備える、請求項1又は2に記載の頻出系列の列挙装置。
    The sorted array comprises a prefix array;
    The system further comprises: a prefix array creating unit that generates the prefix array, the reverse array of the prefix array, and the original series data while receiving a series data set as an input and inserting a break of the series data. Item 3. An apparatus for enumerating frequent sequences according to Item 1 or 2.
  6.  前記ソート済み配列記憶部は、前記接頭辞配列作成部が生成した、前記接頭辞配列と、前記接頭辞配列の逆配列と、前記元系列データとを記憶する接頭辞配列記憶部から成る、請求項5に記載の頻出系列の列挙装置。 The sorted array storage unit includes a prefix array storage unit that stores the prefix array, the reverse array of the prefix array, and the original series data generated by the prefix array creation unit. Item 6. The frequent sequence listing device according to Item 5.
  7.  列挙装置を用いて、データ列内から高頻度な系列を列挙する方法であって、
     系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に,辞書上の位置から元系列データ上の位置を参照可能にするソート済み配列を、ソート済み配列記憶部に格納する格納ステップと、
     前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える計算ステップと、
     前記ソート済み配列内の位置の集合と、特定の文字を入力として、前記ソート済み配列内の値に対する加減算を行い、前記入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込む写像ステップと、
    を備える、頻出系列の列挙方法。
    A method of enumerating a high-frequency series from within a data string using an enumeration device,
    A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Steps,
    A calculation step of counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
    A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping step for narrowing down the set of positions of interest;
    A method for enumerating frequent sequences.
  8.  前記計算ステップは、前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いる、請求項7に記載の頻出系列の列挙方法。 The calculation step uses the sorted array and the set of positions on the sorted array as inputs, and calculates the number of characters existing on the original series data by using the sorted array being ordered. 8. The frequent sequence enumeration method according to claim 7, wherein whether or not to perform a binary search is switched according to frequency when counting.
  9.  データ列内から高頻度な系列を、コンピュータに列挙させるプログラムを記録したコンピュータ読取可能な記録媒体であって、前記プログラムは前記コンピュータに、
     系列データについての全接尾辞ないし全接頭辞を辞書順で並べた場合に,辞書上の位置から元系列データ上の位置を参照可能にするソート済み配列を、ソート済み配列記憶部に格納する格納手順と、
     前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える計算手順と、
     前記ソート済み配列内の位置の集合と、特定の文字を入力として、前記ソート済み配列内の値に対する加減算を行い、前記入力された文字に隣接する接尾辞ないし接頭辞の登場位置を参照し、対象とする位置の集合を絞り込む写像手順と、
    を実行させる、コンピュータ読取可能な記録媒体。
    A computer-readable recording medium recording a program for causing a computer to list a high-frequency series from within a data string, wherein the program is stored in the computer,
    A storage that stores a sorted array in the sorted array storage section that enables reference to a position on the original series data from a position on the dictionary when all suffixes or prefixes about the series data are arranged in dictionary order Procedure and
    A calculation procedure for counting the number of characters existing on the original series data using the sorted array being ordered, using the sorted array and a set of positions on the sorted array as inputs,
    A set of positions in the sorted array and a specific character as input, and addition / subtraction with respect to the value in the sorted array, referring to the appearance position of a suffix or prefix adjacent to the input character, A mapping procedure for narrowing down the set of target positions;
    A computer-readable recording medium for executing
  10.  前記計算手順は、前記コンピュータに、前記ソート済み配列と前記ソート済み配列上の位置の集合を入力として、前記元系列データ上に存在する文字の数を、前記ソート済み配列が順序付けされていることを利用して数える際に、頻度によって2分探索を行うか否かを切り変えて用いさせる、請求項9に記載のコンピュータ読取可能な記録媒体。 In the calculation procedure, the sorted array is ordered based on the number of characters existing in the original series data by inputting the sorted array and a set of positions on the sorted array to the computer. The computer-readable recording medium according to claim 9, wherein, when counting by using, whether or not to perform a binary search is switched depending on frequency.
PCT/JP2014/071136 2013-08-23 2014-08-05 Frequent sequence enumeration device, method, and recording medium WO2015025751A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2013173134 2013-08-23
JP2013-173134 2013-08-23

Publications (1)

Publication Number Publication Date
WO2015025751A1 true WO2015025751A1 (en) 2015-02-26

Family

ID=52483528

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/071136 WO2015025751A1 (en) 2013-08-23 2014-08-05 Frequent sequence enumeration device, method, and recording medium

Country Status (1)

Country Link
WO (1) WO2015025751A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001067378A (en) * 1999-06-23 2001-03-16 Sumitomo Electric Ind Ltd Calculation method and device for similarity of character string and recording medium
JP2002229987A (en) * 2001-01-11 2002-08-16 Internatl Business Mach Corp <Ibm> Method for pattern-search, apparatus thereof, computer program and record medium
JP2003228571A (en) * 2001-11-28 2003-08-15 Kyoji Umemura Method of counting appearance frequency of character string, and device for using the method
JP2004272639A (en) * 2003-03-10 2004-09-30 Nippon Telegr & Teleph Corp <Ntt> Word extraction method, device and program
JP2013061745A (en) * 2011-09-12 2013-04-04 Nippon Telegr & Teleph Corp <Ntt> Representative document selection device, method, and program, and computer-readable recording medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001067378A (en) * 1999-06-23 2001-03-16 Sumitomo Electric Ind Ltd Calculation method and device for similarity of character string and recording medium
JP2002229987A (en) * 2001-01-11 2002-08-16 Internatl Business Mach Corp <Ibm> Method for pattern-search, apparatus thereof, computer program and record medium
JP2003228571A (en) * 2001-11-28 2003-08-15 Kyoji Umemura Method of counting appearance frequency of character string, and device for using the method
JP2004272639A (en) * 2003-03-10 2004-09-30 Nippon Telegr & Teleph Corp <Ntt> Word extraction method, device and program
JP2013061745A (en) * 2011-09-12 2013-04-04 Nippon Telegr & Teleph Corp <Ntt> Representative document selection device, method, and program, and computer-readable recording medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AKIKO AIZAWA: "Ta-class Bunsho Bunrui ni Okeru Ziv-Merhav Crossparsing no Tekiyo to Hyoka", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN RONBUNSHI JOURNAL, vol. 52, no. 10, 31 October 2011 (2011-10-31), pages 2953 - 2964 *
NAOKI TANIDA ET AL.: "Hardware Acceleration of Full-Text Search Using Succinct Data Structure", IEICE TECHNICAL REPORT, vol. 108, no. 180, 29 July 2008 (2008-07-29), pages 7 - 12 *
NAOYA ITO, RECENT PERL WORLD DAI 18 KAI SUFFIX ARRAY (SETSUBIJI HAIRETSU) ALGORITHM & DATA KOZO SERIES 1, vol. 50, no. 1ST ED, 15 May 2009 (2009-05-15), pages 185 - 193 *

Similar Documents

Publication Publication Date Title
US8200646B2 (en) Efficient retrieval of variable-length character string data
US8387003B2 (en) Pluperfect hashing
US7725510B2 (en) Method and system for multi-character multi-pattern pattern matching
JP4538449B2 (en) String search method and equipment
US8134483B2 (en) Data processing apparatus and method
JP2005025763A (en) Division program, division device and division method for structured document
CN111370064B (en) Rapid classification method and system for gene sequences of SIMD (Single instruction multiple data) -based hash function
Haj Rachid et al. A practical and scalable tool to find overlaps between sequences
Huang et al. Fast algorithms for finding the common subsequence of multiple sequences
Hayfron-Acquah et al. Improved selection sort algorithm
Sogabe et al. An acceleration method of short read mapping using FPGA
WO2015025751A1 (en) Frequent sequence enumeration device, method, and recording medium
JP4347087B2 (en) Pattern matching apparatus and method, and program
JP6773115B2 (en) Similar data search device, similar data search method and recording medium
JP2010225137A (en) Retrieval program and retrieval method
KR102074734B1 (en) Method and apparatus for pattern discoverty in sequence data
JP5429164B2 (en) Finite automaton generation system
US8438536B2 (en) Encoding switch on ordered universes with binary decision diagrams
JP4347086B2 (en) Pattern matching apparatus and method, and program
JP2012128672A (en) Homology search device and program
KR101793005B1 (en) Incremental high utility pattern mining method with static and dynamic databases
WO2013046669A1 (en) Space-filling curve processing system, space-filling curve processing method, and program
JP2008159015A (en) Frequent pattern mining system and frequent pattern mining method
Rachid Research Article Two Efficient Techniques to Find Approximate Overlaps between Sequences
KR20190139227A (en) System and method for creating a filter for K-match matching

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14838706

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14838706

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP