KR102146625B1

KR102146625B1 - Apparatus and method for computing incrementally infix probabilities based on automata

Info

Publication number: KR102146625B1
Application number: KR1020190012472A
Authority: KR
Inventors: 한요섭; 마르코 코그네타; 권순찬; 박준우
Original assignee: 연세대학교 산학협력단
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2020-08-20
Also published as: KR20200094977A

Abstract

본 발명은 결정적 유한 오토마타를 기반으로 획득된 DFA 모델을 이용하여 문자열을 정규 표현식으로 획득한 후, 명료한 정규 표현식으로 변환하고, 확률적 유한 오토마타를 이용하여 명료한 정규 표현식으로 표현된 문자열의 출현 확률을 정확하게 계산할 수 있도록 하고, 정규 표현식으로 표현된 문자열의 증분에 따른 출현 확률을 정규 표현식의 증분 방식으로 획득함으로써 용이하게 획득할 수 있도록 하는 오토마타 기반 증분적 중위 확률 계산 장치 및 방법을 제공할 수 있다.In the present invention, a character string is obtained as a regular expression using a DFA model obtained based on a deterministic finite automata, and then converted into a clear regular expression, and the appearance of a character string expressed as a clear regular expression using a probabilistic finite automata. It is possible to provide an automata-based incremental median probability calculation apparatus and method that enables the probability to be accurately calculated and to easily obtain the occurrence probability according to the increment of the string expressed by the regular expression by the incremental method of the regular expression. have.

Description

Device and method for calculating incremental median probability based on automata {APPARATUS AND METHOD FOR COMPUTING INCREMENTALLY INFIX PROBABILITIES BASED ON AUTOMATA}

본 발명은 중위 확률 계산 장치 및 방법에 관한 것으로, 오토마타 기반 증분적 중위 확률 계산 장치 및 방법에 관한 것이다.The present invention relates to a median probability calculation apparatus and method, and to an automata-based incremental median probability calculation apparatus and method.

자연어 처리에서 주요 작업 중 하나는 주어진 구문(phrase)의 출현 확률 또는 주어진 패턴과 매칭되는 모든 구문의 출현 확률을 계산하는 것이다.One of the main tasks in natural language processing is to calculate the probability of occurrence of a given phrase or of all phrases that match a given pattern.

일반적으로 주어진 패턴과 매칭되는 문자열(string)의 분포를 언어 모델로 모델링하기 위해 확률적 문맥 자유 문법(probabilistic context-free grammar) 또는 확률적 유한 오토마타(Probabilistic Finite Automata: 이하 PFA)가 주로 사용하고 있다. 이중에서도 확률적 유한 오토마타는 많은 확률적 언어 현상을 간단하지만 강력하고 잘 이해할 수 있는 표현으로 제공할 수 있다는 장점으로 인해, 현재 음성 처리 작업의 대부분이 PFA를 이용하고 있다.In general, probabilistic context-free grammar or Probabilistic Finite Automata (hereinafter referred to as PFA) is mainly used to model the distribution of strings matching a given pattern with a language model. . Among these, the probabilistic finite automata uses PFA for most of the current speech processing tasks due to the advantage of being able to provide many probabilistic language phenomena in a simple yet powerful and understandable expression.

PFA에서 중요한 문제는 주어진 패턴 분포에 대한 접사(affix)의 확률을 계산하는 것이다. 즉 PFA와 문자열(w)이 주어지면, PFA가 모델링한 문자열의 분포에서 문자열(w)이 다양한 위치에 나타날 확률을 계산하는 것이다. 예를 들어, PFA에 의해 모델링된 문자열 분포에서 문자열(w)은 다른 임의의 문자 또는 문자열 x의 전단에 배치되어 wx와 같은 접두사의 형태로 나타나거나, 후단에 배치되어 xw와 같은 접미사의 형태로 나타날 수 있다. 또한 임의의 문자 또는 문자열들 x, y의 사이에 배치되어 xwy와 같이 가운데 나타날 수 있으며, PFA는 이러한 모든 경우에 대한 확률의 합을 계산해야 한다.An important problem in PFA is calculating the probability of affixes for a given pattern distribution. That is, given a PFA and a character string (w), the probability of the character string (w) appearing at various locations in the distribution of the character string modeled by the PFA is calculated. For example, in the string distribution modeled by the PFA, the string (w) is placed in the form of a prefix such as wx by placing it in the front of another arbitrary character or string x, or in the form of a suffix such as xw by placing it at the rear end. Can appear. In addition, arbitrary characters or strings can be placed between x and y to appear in the middle, such as xwy, and the PFA must calculate the sum of the probabilities for all these cases.

주어진 문자열(w)이 접두사로 나타나는 경우에 대한 확률은 계산이 상대적으로 용이한 반면, 접미사 또는 중위어로 나타나는 경우에 대한 확률은 문자열(w)이 반복적으로 나타날 수 있다는 문제로 인해 확률의 계산이 용이하지 않다. 그럼에도 문자열(w)이 접두사뿐만 아니라, 접미사나 중위어로 나타날 확률에 대한 계산 방법 또한 이미 많은 연구가 수행되어 공지되었다.The probability of a given string (w) appearing as a prefix is relatively easy to calculate, while the probability of appearing as a suffix or infix is easy to calculate the probability due to the problem that the string (w) may appear repeatedly. Not. Nevertheless, a method for calculating the probability that the character string w appears as a prefix, as well as a suffix or an infix, has already been conducted and is known.

다만 기존의 연구에서는 주어진 문자열(w)이 나타날 확률을 계산할 수 있는 반면, 문자열(w)의 증분(또는 확장이라고도 함)에 대한 확률은 계산할 수 없다는 한계가 있다. 예를 들면, PFA가 모델링한 문자열의 분포에서 문자열(w)이 나타날 확률은 계산할 수 있는 반면, 문자(character)(a)가 추가된 문자열(wa)이 나타날 확률은 별도로 다시 계산해야 한다. 마찬가지로 문자열(wa)에 대한 확률을 계산하더라도, 문자열(w)에 대한 확률을 별도로 계산해야 한다.However, in the existing studies, while the probability of the occurrence of a given string (w) can be calculated, there is a limitation in that the probability of the increment (or expansion) of the string (w) cannot be calculated. For example, the probability of the occurrence of the character string (w) in the distribution of the character string modeled by the PFA can be calculated, while the probability that the character (a) is added to the character string (wa) must be separately recalculated. Similarly, even if the probability for the string (wa) is calculated, the probability for the string (w) must be calculated separately.

따라서 문자열의 증분에 따른 확률을 각각 별도로 계산해야 하므로, 매우 긴 계산 시간을 요구할 뿐만 아니라, 많은 자원을 요구한다는 문제가 있다.Therefore, since the probability according to the increment of the string must be calculated separately, there is a problem that not only a very long calculation time is required, but also a lot of resources are required.

한국 등록 특허 제10-1645890호 (2016.07.29 등록)Korean Patent Registration No. 10-1645890 (registered on July 29, 2016)

본 발명의 목적은 문자열의 증분에 따른 확률을 정확하고 신속하게 계산할 수 있어, 증분적 중위 확률을 용이하게 획득할 수 있는 오토마타 기반 증분적 중위 확률 계산 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide an automata-based incremental median probability calculation apparatus and method capable of accurately and quickly calculating a probability according to an increment of a character string, thereby easily obtaining an incremental median probability.

본 발명의 다른 목적은 문자열의 증분적 중위 확률을 동시에 계산하여, 고속으로 문자열의 증분에 따른 확률을 계산할 수 있는 오토마타 기반 증분적 중위 확률 계산 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an automata-based incremental median probability calculation apparatus and method capable of simultaneously calculating the incremental median probability of a character string and calculating the probability according to an increment of a character string at high speed.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 장치는 출현 확률을 계산할 문자열(w)의 각 문자에 대하여 결정적 유한 오토마타(이하 DFA)를 기반으로 정규 표현식 형태의 정규 언어를 획득하기 위해 다수의 상태 및 전이 함수로 구성된 DFA 모델을 획득하는 DFA 모델 획득부; 상기 DFA 모델에 초기 상태 및 단일 최종 상태와 초기 상태 및 단일 최종 상태에 대응하는 전이 함수를 추가한 후, DFA 모델의 각 상태들 간의 경로 중 중첩될 수 있는 경로를 동적으로 반복 제거하여, 명료한 정규 표현식을 표현할 수 있는 DFA 모델로 변환하는 정규 표현식 변환부; 변환된 DFA 모델의 각 상태 및 전이 경로의 확률을 확률적 유한 오토마타(이하 PFA)로 획득되는 PFA 모델을 기반으로 계산하여, 변환된 DFA 모델의 가중치로 적용하는 PFA 모델 교차부; 및 가중치가 적용된 DFA 모델에 문자열(w)을 포함하는 문자열 집합(F(w)) 중 증분된 문자열(wa)이 출현하는 문자열 집합(F(wa))에도 포함되는 문자열(F(w)＼F(wa))에 대한 상태와 전이 함수를 추가함으로써, 증분 가중치가 적용된 DFA 모델을 획득하고, 획득된 증분 가중치가 적용된 DFA 모델을 이용하여, 문자열 및 증분된 문자열의 출현 확률을 계산하는 증분 확률 계산부; 를 포함한다.In order to achieve the above object, the automata-based incremental median probability calculation apparatus according to an embodiment of the present invention is based on a deterministic finite automata (hereinafter referred to as DFA) for each character of the character string w for calculating the appearance probability. A DFA model acquisition unit that acquires a DFA model composed of a plurality of states and transition functions to acquire a regular language; After adding an initial state, a single final state, and a transition function corresponding to the initial state and the single final state to the DFA model, paths that may overlap among the paths between the states of the DFA model are dynamically repeatedly removed, and A regular expression conversion unit that converts a regular expression into a DFA model capable of expressing a regular expression; A PFA model intersection that calculates the probability of each state and transition path of the transformed DFA model based on a PFA model obtained by a probabilistic finite automata (hereinafter referred to as PFA) and applies it as a weight of the transformed DFA model; And the character string (F(w)\) included in the character string set (F(wa)) in which the incremented character string (wa) appears among the character string set (F(w)) including the character string (w) in the weighted DFA model. Incremental probability of obtaining a DFA model to which an incremental weight is applied by adding a state and transition function for F(wa)), and calculating the occurrence probability of a character string and an incremented character string by using the acquired incremental weighted DFA model. Calculation unit; Includes.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 방법은 출현 확률을 계산할 문자열(w)의 각 문자에 대하여 결정적 유한 오토마타(이하 DFA)를 기반으로 정규 표현식 형태의 정규 언어를 획득하기 위해 다수의 상태 및 전이 함수로 구성된 DFA 모델을 획득하는 단계; 상기 DFA 모델에 초기 상태 및 단일 최종 상태와 초기 상태 및 단일 최종 상태에 대응하는 전이 함수를 추가한 후, DFA 모델의 각 상태들 간의 경로 중 중첩될 수 있는 경로를 동적으로 반복 제거하여, 명료한 정규 표현식을 표현할 수 있는 DFA 모델로 변환하는 단계; 변환된 DFA 모델의 각 상태 및 전이 경로의 확률을 확률적 유한 오토마타(이하 PFA)로 획득되는 PFA 모델을 기반으로 계산하여, 변환된 DFA 모델의 가중치로 적용하는 단계; 및 가중치가 적용된 DFA 모델에 문자열(w)을 포함하는 문자열 집합(F(w)) 중 증분된 문자열(wa)이 출현하는 문자열 집합(F(wa))에도 포함되는 문자열(F(w)＼F(wa))에 대한 상태와 전이 함수를 추가함으로써, 증분 가중치가 적용된 DFA 모델을 획득하고, 획득된 증분 가중치가 적용된 DFA 모델을 이용하여, 문자열 및 증분된 문자열의 출현 확률을 계산하는 단계; 를 포함한다.The automata-based incremental median probability calculation method according to another embodiment of the present invention for achieving the above object is based on a deterministic finite automata (hereinafter referred to as DFA) for each character of the character string w for calculating the appearance probability. Obtaining a DFA model composed of a plurality of state and transition functions to obtain a canonical language; After adding an initial state, a single final state, and a transition function corresponding to the initial state and a single final state to the DFA model, paths that may overlap among the paths between the states of the DFA model are dynamically and repeatedly removed, and Converting a regular expression into a DFA model capable of representing; Calculating the probability of each state and transition path of the transformed DFA model based on a PFA model obtained by a probabilistic finite automata (hereinafter referred to as PFA), and applying it as a weight of the transformed DFA model; And the character string (F(w)\) included in the character string set (F(wa)) in which the incremented character string (wa) appears among the character string set (F(w)) including the character string (w) in the weighted DFA model. Obtaining a DFA model to which an incremental weight is applied by adding a state and a transition function for F(wa)), and calculating the occurrence probability of a character string and the incremented character string by using the obtained DFA model to which the incremental weight is applied; Includes.

따라서, 본 발명의 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 장치 및 방법은 결정적 유한 오토마타를 기반으로 획득된 DFA 모델을 이용하여 문자열을 정규 표현식으로 획득한 후, 명료한 정규 표현식으로 변환하고, 확률적 유한 오토마타를 이용하여 명료한 정규 표현식으로 표현된 문자열의 출현 확률을 정확하게 계산할 수 있도록 한다. 또한 정규 표현식으로 표현된 문자열의 증분에 따른 출현 확률을 정규 표현식의 증분 방식으로 획득함으로써 용이하게 획득할 수 있도록 한다.Therefore, the automata-based incremental median probability calculation apparatus and method according to an embodiment of the present invention obtains a character string as a regular expression using a DFA model obtained based on a deterministic finite automata, and then converts it into a clear regular expression, Using a probabilistic finite automata, the probability of occurrence of a string expressed by a clear regular expression can be accurately calculated. In addition, the probability of occurrence according to the increment of the string expressed by the regular expression can be easily obtained by obtaining the incremental method of the regular expression.

도1 은 본 발명의 일 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 장치의 개략적 구조를 나타낸다.
도2 는 도1 의 정규 표현식 변환부가 DFA 모델의 상태 및 전이를 추가하는 일예를 나타낸다.
도3 은 DFA 모델과 PFA 모델의 교차 알고리즘을 나타낸다.
도4 는 문자열과 그 증분에 대한 정규 표현식의 관계를 나타낸다.
도5 는 도1 의 증분 확률 계산부가 DFA 모델로부터 문자열와 그 증분 문자열이 출연할 확률을 누적하여 계산하는 알고리즘을 나타낸다.
도6 은 문자열의 길이와 상태 집합의 크기에 따른 증분 및 교차 방식의 적용에 따른 연산 소요 시간 측정 결과를 나타낸다.
도7 은 본 발명의 일 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 방법을 나타낸다.1 shows a schematic structure of an automata-based incremental median probability calculation apparatus according to an embodiment of the present invention.
FIG. 2 shows an example in which the regular expression conversion unit of FIG. 1 adds states and transitions of a DFA model.
3 shows an algorithm of the intersection of the DFA model and the PFA model.
4 shows the relationship between a character string and a regular expression for its increment.
FIG. 5 shows an algorithm in which the incremental probability calculation unit of FIG. 1 accumulates and calculates a character string and a probability that the incremental character string appears from a DFA model.
6 shows the result of measuring the time required for operation according to the application of the incremental and crossover method according to the length of a character string and the size of a state set.
7 shows an automata-based incremental median probability calculation method according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in a number of different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that other components are not excluded, but may further include other components unless otherwise stated. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. And software.

도1 은 본 발명의 일 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 장치의 개략적 구조를 나타낸다.1 shows a schematic structure of an automata-based incremental median probability calculation apparatus according to an embodiment of the present invention.

도1 을 참조하면, 본 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 장치는 문자열 획득부(110), DFA 모델 획득부(120), 정규 표현식 변환부(130), 행렬 변환부(140), PFA 모델 교차부(150) 및 증분 확률 계산부(160)를 포함한다.Referring to FIG. 1, the automata-based incremental median probability calculation apparatus according to the present embodiment includes a string acquisition unit 110, a DFA model acquisition unit 120, a regular expression conversion unit 130, a matrix conversion unit 140, A PFA model intersection unit 150 and an incremental probability calculation unit 160 are included.

우선 문자열 획득부(110)는 출현 확률을 계산할 문자열(w)을 획득한다. 이때 문자열 획득부(110)는 문자열(w)이 포함되었는지 탐색해야 하는 언어(문장)를 함께 획득할 수도 있다. 이하에서는 문자의 집합(set of characters)을 Σ라고 하고, 모든 문자열의 집합(set of all strings)을 Σ^*라 하며, 탐색할 문자열(w)은 문자열 집합(Σ^*)의 원소(w = w₁, w₂, ..., w_n ∈ Σ^*)이다. 그리고 문자열(w)의 길이(n)는 |w| = n이며, 공백 문자열(empty string)은 λ라고 한다.First, the character string acquisition unit 110 acquires a character string w for calculating an appearance probability. At this time, the string acquisition unit 110 may also acquire a language (sentence) to be searched for whether the string w is included. In the following, the set of characters is referred to as Σ, the set of all strings is referred to as Σ ^* , and the string to be searched (w) is an element (w = w) of the set of strings (Σ ^* ). ₁ , w ₂ , ..., w _n ∈ Σ ^* ). And the length (n) of the string (w) is |w| = n, and the empty string is called λ.

DFA 모델 획득부(120)는 문자열(w)의 각 문자에 대하여 유일한 상태변화를 갖는 결정적 유한 오토마타(Deterministic Finite Automata: 이하 DFA)를 기반으로 모델링된 DFA 모델(D)을 획득한다.The DFA model acquisition unit 120 acquires a DFA model D modeled based on a deterministic finite automata (hereinafter referred to as DFA) having a unique state change for each character of the character string w.

DFA 모델(D)은 탐색해야 하는 문자열(w)을 정규 언어(Regular Language)(L(D))로 변환하기 위한 모델로서, 일반적으로 (Q, Σ, δ, q₁, F)의 5-튜플(5-tuple)을 갖도록 모델링된다. 여기서 Q는 다수의 상태들(q = q₁, q₂, ..., q_n ∈ Q)의 유한 집합(finite set of states)을 나타내고, δ는 전이 함수(transition function)로서 상태 집합(Q)에 대해 문자열 집합(Σ)의 전이 경로는 상태 집합(Q)의 상태 중 적어도 하나에 포함되는 δ: Q × Σ → Q의 조건을 만족한다. 그리고, q₁은 상태 집합(Q) 중 초기 상태(initial state)(q₁ ∈ Q)를 나타내고, F는 적어도 하나의 상태를 갖는 최종 상태 집합(set of final state)으로 상태 집합(Q)에 포함되는 집합(F ⊂ Q)이다.The DFA model (D) is a model for converting the character string (w) to be searched into Regular Language (L(D)). In general, the 5 of (Q, Σ, δ, q ₁ , F) It is modeled to have a 5-tuple. Where Q represents a finite set of states of multiple states (q = q ₁ , q ₂ , ..., q _n ∈ Q), and δ is a transition function, where Q ), the transition path of the character string set Σ satisfies the condition of δ: Q × Σ → Q included in at least one of the states of the state set Q. And, q ₁ represents the initial state (q ₁ ∈ Q) of the state set (Q), and F is a set of final state having at least one state, which is in the state set (Q). It is the included set (F ⊂ Q).

언어는 문자열의 집합이므로, DFA에 의해 미리 의해 모델링된 DFA 모델(D)은 획득된 언어를 정규 언어(L(D))로 인식할 수 있다. 이때 DFA 모델(D)은 정규 언어(L(D))를 정규 표현식(Unambiguous Regular Expression)의 형태로 획득하여 인식할 수 있다.Since the language is a set of character strings, the DFA model (D) modeled in advance by the DFA may recognize the acquired language as a regular language (L(D)). In this case, the DFA model D may acquire and recognize the regular language L(D) in the form of an unambiguous regular expression.

정규 표현식은 정규 언어의 일반적인 표현 방식으로, 기지정된 다양한 메타문자를 이용하여 정규 언어(L(D))를 간략하게 표현할 수 있도록 한다. 일예로, 정규 표현식은 a, aa, aaa, ... 와 같은 정규 언어를 0회 이상의 출현을 의미하는 메타문자 "*"를 이용하여 a^* 로 간략하게 표현할 수 있도록 한다. 또한 정규 표현식에서 "+" 는 1번 이상의 발생을 의미하는 메타문자로서 "ab+c" 는 "abc", "abbc", "abbbc" 등을 의미할 수 있다.A regular expression is a general expression method of a regular language, and allows a simple expression of a regular language (L(D)) using a variety of known metacharacters. For example, the regular expression allows a regular language such as a, aa, aaa, ... to be briefly expressed as a ^* by using the metacharacter "*" meaning zero or more occurrences. In addition, in a regular expression, "+" is a metacharacter that means more than one occurrence, and "ab+c" may mean "abc", "abbc", "abbbc", and the like.

다만 DFA 모델(D)이 정규 언어(L(D))를 인식하여 정규 표현식으로 변환하는 경우, 불명료한 정규 표현식(ambiguous Regular Expression)이 생성될 가능성이 높다.However, if the DFA model (D) recognizes the regular language (L(D)) and converts it into a regular expression, it is highly likely that an ambiguous regular expression will be generated.

일 예로, 문자열의 집합(Σ)의 문자가 a, b 인 경우(Σ = {a, b})에 대한 정규 표현식((a∪b)^*aa(a∪b)^*)에서 문자열 baa를 탐색하는 경우, 문자열(baa)은 둘 이상의 위치에서 탐색될 수 있다. 이를 명확히 하기 위해 상기의 정규 표현식((a∪b)^*aa(a∪b)^*)에 첨자를 부여하면, 정규 표현식((a₁∪b₁)^*a₂a₃(a₄∪b₂)^*)으로 표현될 수 있다. 그리고 정규 표현식((a₁∪b₁)^*a₂a₃(a₄∪b₂)^*)에서 문자열(baa) 는 b₁a₁a₂a₃로도 탐색될 수 있고, b₁a₂a₃a₄로도 탐색될 수 있다. 즉 하나의 정규 표현식((a∪b)^*aa(a∪b)^*)에서 하나의 문자열(baa)이 중복으로 탐색될 수 있다. 이는 문자열(baa)의 출연 확률을 중복으로 계산할 수 있도록 하는 오류를 발생시키는 불명료한 정규 표현식이다.As an example, search for the string baa in the regular expression ((a∪b) ^* aa(a∪b) ^* ) for the case where the characters in the set of strings (Σ) are a and b (Σ = {a, b}). If so, the character string baa can be searched in more than one position. To clarify this, if a subscript is added to the above regular expression ((a∪b) ^* aa(a∪b) ^* ), the regular expression ((a ₁ ∪b ₁ ) ^* a ₂ a ₃ (a ₄ ∪b ₂ It can be expressed as) ^* ). And in the regular expression ((a ₁ ∪b ₁ ) ^* a ₂ a ₃ (a ₄ ∪b ₂ ) ^* ), the string (baa) can also be searched as b ₁ a ₁ a ₂ a ₃ , and b ₁ a ₂ a It can also be searched for ₃ a ₄ . That is, in one regular expression ((a∪b) ^* aa(a∪b) ^* ), one string (baa) may be searched for duplicate. This is an ambiguous regular expression that generates an error that allows the occurrence probability of a character string (baa) to be calculated redundantly.

그에 반해, 정규 표현식(b*a(bb*a)*a(a∪b)*)은 문자열(baa)을 탐색할 때, 중복 탐색이 발생하지 않는 명료한 표현식이다.On the other hand, the regular expression (b*a(bb*a)*a(a∪b)*) is a clear expression that does not cause duplicate searches when searching for a string (baa).

이러한 오류가 발생하는 것을 방지하기 위해, 정규 표현식 변환부(130)는 DFA 모델 획득부(120)에서 획득된 DFA 모델(D)에 의해 인식된 정규 언어(L(D))으로부터 명료한 정규 표현식의 형태로 변환한다.In order to prevent such an error from occurring, the regular expression conversion unit 130 includes a clear regular expression from the regular language (L(D)) recognized by the DFA model (D) obtained by the DFA model acquisition unit 120. To the form of.

DFA 모델(D)이 D = (Q, Σ, δ, q₁, F)이고, 상태 집합(Q)의 길이(|Q|) = n이고, 1에서 n까지 순차 정렬된 경우, 정규 표현식 변환부(130)는 DFA 모델(D)의 지정된 상태 집합(Q)에 2개의 새로운 상태(q₀, q_n+1)를 추가한다. 그리고 추가된 상태(q₀)는 새로이 지정된 시작 상태가 되도록 전이(δ(q₀, λ) = q₁)를 DFA 모델(D)에 추가하고, 추가된 상태(q_n+1)는 새로이 지정된 유일한 최종 상태가 되도록 같은 전이(∀q ∈ F, δ(q, λ) = q_n+1)를 DFA 모델(D)에 추가한다.If the DFA model (D) is D = (Q, Σ, δ, q ₁ , F), the length of the set of states (Q) (|Q|) = n, and is sequentially ordered from 1 to n, the regular expression transformation The unit 130 adds two new states (q ₀ , q _n+1 ) to the specified state set (Q) of the DFA model (D). And a transition (δ(q ₀ , λ) = q ₁ ) is added to the DFA model (D) so that the added state (q ₀ ) becomes the newly designated starting state, and the added state (q _n+1 ) is newly designated. The same transition (∀q ∈ F, δ(q, λ) = q _n+1 ) is added to the DFA model (D) to be the only final state.

도2 는 도1 의 정규 표현식 변환부가 DFA 모델의 상태 및 전이를 추가하는 일예를 나타낸다.FIG. 2 shows an example in which the regular expression conversion unit of FIG. 1 adds states and transitions of a DFA model.

도2 에서 (a) 및 (b)는 각각 DFA 모델 획득부(120)에서 획득된 DFA 모델(D)의 일예를 나타내고, (c) 및 (d)는 각각 정규 표현식 변환부(130)가 (a) 및 (b)의 DFA 모델(D)에 새로운 상태 및 전이를 추가한 모델을 나타낸다.In FIG. 2, (a) and (b) respectively show an example of the DFA model (D) obtained by the DFA model acquisition unit 120, and (c) and (d) are respectively the regular expression conversion unit 130 ( A model in which new states and transitions are added to the DFA model (D) of a) and (b) is shown.

즉 새로운 시작 상태(q₀)에서는 공백 문자열(λ), 즉 별도의 조건없이 이전의 시작 상태(q₁)로 전이되고, 이전의 적어도 하나의 최종 상태(F) 모두는 새로운 유일한 최종 상태(q_n+1)로 별도의 조건없이 전이되도록 한다.That is, in the new starting state (q ₀ ), an empty string (λ), that is, transitioned to the previous starting state (q ₁ ) without any additional condition, and all of the previous at least one final state (F) is the new and only final state (q _n+1 ) to be transferred without any additional conditions.

그리고 수학식 1을 이용하여, 상태 및 전이가 추가된 DFA 모델(D)에서 상태들(q = q₁, q₂, ..., q_n ∈ Q)을 동적으로 반복하여 제거한다.And, using Equation 1, states (q = q ₁ , q ₂ , ..., q _n ∈ Q) are dynamically repeatedly removed from the DFA model (D) to which states and transitions are added.

수학식 1에서

는 상태(q_i)로부터 시작하여, k(1 ≤ k ≤ n)번째 반복 제거에서 중간 상태(q_l)(여기서 l < k)인 경로로 상태(q_j)로 전이되는 문자열의 집합에 대응하며, 유사하게

는 상태(q_i)로부터 시작하여, 중간 상태(q_k)를 통해 상태(q_j)로 전이되는 모든 문자열에 대응한다.In Equation 1

Corresponds to the set of strings starting from state (q _i ) and transitioning from the k(1 ≤ k ≤ n) th iteration elimination to the intermediate state (q _l ) (where l <k) to state (q _j ) And similarly

Corresponds to all the strings that transition to the state (q _j), starting from the state (q _i), through an intermediate state (q _k).

수학식 1은 정규 표현식에 대한 일반적 연결, 결합 및 Kleene star 규칙을 따르며, 수학식 1 및 2로부터 불명료하게 나타날 수 있는 중첩될 수 있는 정규 표현식의 경로를 제거함으로써, 정규 언어(L(D))에 대한 명료한 정규 표현식(R)은

로서 획득될 수 있다.Equation 1 follows the general concatenation, concatenation, and Kleene star rules for regular expressions, and by removing the paths of nestable regular expressions that may appear indistinctly from

Equations

1 and 2, the regular language (L(D)) The clear regular expression (R) for

Can be obtained as

여기서

는 수학식 2의 조건에 따라 결정된다.here

Is determined according to the condition of Equation 2.

또한 수학식 1에서 문자(c)와 공집합(

)에 대한 연산은 수학식 3의 4가지 연산 방식으로 수행된다.Also, in Equation 1, the letter (c) and the empty set (

) Is performed by the four calculation methods of Equation 3.

정규 표현식 변환부(130)에 의해 DFA 모델(D)이 수정되고, 반복적 상태 제걸르 통해 명료한 정규 표현식(R)이 획득되면, 행렬 변환부(140)가 획득된 명료한 정규 표현식(R)을 행렬 형태로 변환한다.When the DFA model (D) is modified by the regular expression conversion unit 130 and a clear regular expression (R) is obtained through repetitive state detection, the clear regular expression (R) obtained by the matrix conversion unit 140 is obtained. Transform into matrix form.

행렬 변환부(140)는 명료한 정규 표현식(R)을 수학식 4에 따라 맵핑하여 행렬로 변환한다.The matrix conversion unit 140 maps a clear regular expression R according to Equation 4 and converts it into a matrix.

여기서 임의의 2개의 정규 표현식(R, S)에 대응하는 행렬을 각각

및

라 하면, 2개의 정규 표현식(R, S) 사이의 여러 연산은 수학식 5와 같은 행렬의 연산 형태로 표현될 수 있다.Here, matrices corresponding to two random regular expressions (R, S) are each

And

In other words, several operations between two regular expressions (R, S) may be expressed in the form of matrix operations as shown in Equation 5.

PFA 모델 교차부(150)는 확률적 유한 오토마타(probabilistic finite automata: 이하 PFA)에 의해 모델링된 PFA 모델(P)을 기반으로 명료한 정규 표현식(R)에 대한 정규 표현 행렬(

)로부터 정규 표현식(R)의 가중치를 획득하여, 정규 표현식 변환부(130)에 의해 수정된 DFA 모델(D)의 각 상태 및 전이 경로에 대해 확률로 표현되는 가중치를 적용함으로써, PFA 모델(P)과 교차된 DFA 모델(D)을 획득한다.The PFA model intersection 150 is a regular expression matrix for a clear regular expression (R) based on the PFA model (P) modeled by probabilistic finite automata (PFA).

) From the regular expression (R), and by applying a weight expressed as a probability to each state and transition path of the DFA model (D) modified by the regular expression conversion unit 130, the PFA model (P ) And the crossover DFA model (D) is obtained.

PFA 모델(P)은 모든 문자열 집합(Σ^*)에 대한 [0, 1]에 대응하는 값의 가중치를 갖는 확률 함수를 획득하기 위한 모델로서, 문자열(w)의 가중치(P(w)) 및 경로(path)(π)의 가중치(P(π))를 계산한다.The PFA model (P) is a model for obtaining a probability function having a weight corresponding to [0, 1] for all sets of strings (Σ ^* ), and the weight (P(w)) of the string (w) and Calculate the weight (P(π)) of the path (π).

탐색할 문자열(w = w₁, w₂, ..., w_n ∈ Σ^*)을 고려할 때, 대응하는 경로(π)는 PFA 모델(P)에서 π = (q₀, w₁, q₁), (q₁, w₂, q₂), ..., (q_n-1, w_n, q_n) 이다.Considering the string to be searched (w = w ₁ , w ₂ , ..., w _n ∈ Σ ^* ), the corresponding path (π) is π = (q ₀ , w ₁ , q ₁ in the PFA model (P)) ), (q ₁ , w ₂ , q ₂ ), ..., (q _n-1 , w _n , q _n ).

PFA 모델(P)은 DFA 모델(D)과 유사하게, (Q, Σ, δ, I, F)의 5-튜플(5-tuple)을 갖도록 모델링되고, 여기서 PFA 모델(P)에서 전이 함수(δ)는 δ: Q × Σ × Q → [0, 1]의 조건을 만족하며, I와 F는 각각 적어도 상태를 갖는 초기 상태 집합(set of initial state)의 각 상태의 확률(I: Q → [0, 1])을 나타내고, F는 적어도 하나의 상태를 갖는 최종 상태 집합(set of final state)의 각 상태의 확률(F: Q → [0, 1])을 나타낸다. 여기서 전이 함수(δ)의 디폴트 값은 0으로 가정된다. 즉 상태들 사이에 전이가 존재하지 않으면, 가중치가 0인 것으로 고려된다.The PFA model (P) is modeled to have a 5-tuple of (Q, Σ, δ, I, F), similar to the DFA model (D), where in the PFA model (P) the transfer function ( δ) satisfies the condition of δ: Q × Σ × Q → [0, 1], and I and F are the probability of each state of the set of initial state having at least a state (I: Q → [0, 1]), and F represents the probability (F: Q → [0, 1]) of each state of a set of final states having at least one state. Here, the default value of the transfer function δ is assumed to be 0. That is, if there is no transition between states, the weight is considered to be zero.

그리고 PFA 모델(P)에서는 모든 초기 상태 집합의 확률의 합은 1 (

)이고, 모든 상태에 대해 최종 상태의 확률과 경로별 확률의 합은 1(

)이고, 모든 상태는 접근 가능하거나 상태간 상호 접근이 가능하다는 조건을 만족한다.And in the PFA model (P), the sum of the probabilities of all sets of initial states is 1 (

), and for all states, the sum of the probability of the final state and the probability of each path is 1(

), and all states are accessible or satisfy the condition that states are mutually accessible.

상기한 조건을 만족한다면, 문자열(w)에 대한 가중치(P(w))는 0 ≤ P(w) ≤ 1이고, Σ_w∈Σ*P(w) = 1이다.If the above conditions are satisfied, the weight (P(w)) for the character string w is 0≦P(w)≦1, and Σ _w∈Σ* P(w) = 1.

이에 PFA 모델(P)에서 경로(π)에 대한 가중치, 즉 확률은 수학식 6으로 계산될 수 있다.Accordingly, in the PFA model P, the weight, that is, the probability, for the path π may be calculated by Equation 6.

문자열(w)에 대응하는 모든 경로의 집합을 Φ_w라 하면, PFA 모델(P)에서 문자열(w)에 대한 가중치, 즉 확률은

로 계산된다.If the set of all paths corresponding to the string (w) is Φ _w , then the weight, that is, the probability, for the string (w) in the PFA model (P) is

Is calculated as

한편, 수학식 4에서와 같이 문자(c)를 행렬(

)로 표현한 경우, PFA 모델(P) 는

와 같이 행렬 형태의 튜플을 갖는다. 여기서

는

인 |Q| × |Q| 전이 행렬의 집합이다. 그리고

는

인 1 × |Q|이고,

는

인 |Q| × 1 벡터이다.Meanwhile, as in Equation 4, the character (c) is a matrix (

), the PFA model (P) is

It has a tuple in the form of a matrix. here

Is

Phosphorus |Q| × |Q| It is a set of transition matrices. And

Is

Is 1 × |Q|,

Is

Phosphorus |Q| It is a × 1 vector.

이에 문자열(w)을 행렬 형태로 표현하는 경우, 문자열(w) 행렬에 대한 확률은 수학식 7과 같이 표현된다.Accordingly, when the character string w is expressed in the form of a matrix, the probability for the character string w matrix is expressed as in Equation 7.

간결함을 위해

로 표현하고, 0과 1을 각각 0 행렬 및 항등 행렬(identity matrix)로 표현할 수 있다.For brevity

And 0 and 1 can be expressed as a 0 matrix and an identity matrix, respectively.

PFA 모델(P)에서 문자의 집합(Σ)에 대한 행렬의 가중치는

로 계산될 수 있다.In the PFA model (P), the weight of the matrix for the set of characters (Σ) is

Can be calculated as

이에 PFA 모델(P)에서 문자열(w)이 문자열 집합(Σ^*)에 대한 접두사(wΣ^*) 또는 접미사(Σ^*w)로 나타날 확률은 각각 수학식 8 및 9로 계산될 수 있다.Accordingly, in the PFA model P, the probability that the character string w appears as a prefix (wΣ ^* ) or a suffix (Σ ^* w) for the character string set (Σ ^* ) can be calculated by Equations 8 and 9, respectively.

만일 오토마톤(M)이 Σ_q∈QI(q) ≤ 1 또는 ∀q ∈ Q 를 제외한 PFA에서 요구되는 F(q) + Σ_{q'∈Q,c∈Σ} δ(q, c, q') ≤ 1 을 모두 만족하면, 오토마톤(M)은 서브-PFA(sub-PFA)라고 할 수 있다. 그리고 서브-PFA(M)는 문자의 집합(Σ)에서 어떤 문자열(w)에 대한 가중치(M(w))로 0 ≤ M(w) ≤ 1를 가지며, Σ_w∈Σ*M(w) ≤ 1이다.If the automaton (M) is Σ _q∈Q I(q) ≤ 1 or F(q) + Σ _{q'∈Q,c∈Σ} δ(q, c, _q'as required by the PFA, excluding ∀q ∈ Q) ) ≤ 1, the automaton (M) can be referred to as a sub-PFA. And sub-PFA(M) has 0 ≤ M(w) ≤ 1 as a weight (M(w)) for a character string (w) in a set of characters (Σ), and Σ _w∈Σ* M(w) ≤ 1.

문자의 집합(Σ)에 대한 확률적 언어(stochastic language)는 S ⊆ Σ^*인 집합(S)이고, 여기서 집합(S)의 각 문자열은 연관 확률(Pr_S(w))로 0 ≤ Pr_S(w) ≤ 1를 가지며, Σ_w∈Σ* Pr_S(w) = 1이다. 주어진 확률적 언어(S)에 대해 ∀w ∈ Σ^*이고, Pr_S(w) = Pr_S(w)이면, 확률적 언어(S)는 정류 확률적 언어(regular stochastic language)라 한다.The stochastic language for a set of characters (Σ) is a set (S) with S ⊆ Σ ^* , where each string in the set (S) is the probability of association (Pr _S (w)), 0 ≤ Pr _S (w) ≤ 1, and Σ _w∈Σ* Pr _S (w) = 1. If ∀w ∈ Σ ^* for a given probabilistic language (S) and Pr _S (w) = Pr _S (w), then the probabilistic language (S) is called a regular stochastic language.

그리고 DFA 모델(D)과 PFA 모델(P)에 교차 적용하는 방식은 정규 언어(L(D))의 가중치를 계산하여 적용하는 형태로 수행될 수 있으며, 수학식 10 및 1에 따라 서브-PFA([D ∩ P])를 생성함으로써, 획득할 수 있다.In addition, the method of cross-applying the DFA model (D) and the PFA model (P) may be performed in a form of calculating and applying the weight of the regular language (L(D)), and according to Equations 10 and 1, the sub-PFA It can be obtained by generating ([D ∩ P]).

수학식 10 및 11에 따른 서브-PFA([D ∩ P])의 교차 알고리즘은 DFA 모델(D)과 PFA 모델(P)이 주어진 경우, 상태 Q_W = Q_D × Q_P를 갖는 새로운 모델(W)을 구성하는 것으로, 모델(W)은 두 상태((x, y), (x', y'))와 문자(c ∈ Σ)에 대해 δ_D(x, c) = x'이면 δ_W((x, y), c, (x', y')) = δ_P(y, c, y')이고, 그렇지 않으면 0으로 나타난다. 유사하게 DFA 모델(D)의 초기 상태가 p이면, I_W((p, q)) = I_P(q)이고, DFA 모델(D)의 최종 상태가 p'이면, F_W((p', q')) = F_P(q')이며, 이외엔 0이다.The intersection algorithm of sub-PFA ([D ∩ P]) according to Equations 10 and 11 is a new model with a state Q _W = Q _D × Q _P given a DFA model (D) and a PFA model (P) ( W), where the model (W) is δ for two states ((x, y), (x', y')) and the letter (c ∈ Σ) if _D (x, c) = x' _W ((x, y), c, (x', y')) = δ _P (y, c, y'), otherwise it appears as zero. Similarly, if the initial state of the DFA model (D) is p, then I _W ((p, q)) = I _P (q), and if the final state of the DFA model (D) is p', then F _W ((p' , q')) = F _P (q'), otherwise 0.

DFA 모델(D)과 PFA 모델(P)의 교차 알고리즘은 결과적으로 도3 과 같이 정리될 수 있다.The intersection algorithm of the DFA model (D) and the PFA model (P) can be consequently organized as shown in FIG. 3.

도3 은 DFA 모델과 PFA 모델의 교차 알고리즘을 나타낸다.3 shows an algorithm of the intersection of the DFA model and the PFA model.

도3 에 나타난 알고리즘에 따르면, 정규 언어(L(D))에서 모든 문자열의 가중치의 합은 수학식 12로 계산될 수 있다.According to the algorithm shown in FIG. 3, the sum of the weights of all character strings in the regular language L(D) can be calculated by Equation 12.

PFA 모델 교차부(150)는 수학식 11과 수학식 12로부터 명료한 정규 언어(L(D))의 정규 표현 행렬(

)에 대한 가중치를 수학식 13으로 획득할 수 있다.The PFA model intersection 150 is a regular expression matrix of a clear regular language (L(D)) from Equations 11 and 12

) Can be obtained by Equation 13.

PFA 모델 교차부(150)에서 PFA 모델(P)과 교차된 DFA 모델(D)은 문자열(w)이 나타날 확률을 계산하는 모델이다. 따라서 문자열(w)의 증분(예를 들면, wa)이 나타날 확률을 계산하지 못한다.The DFA model (D) intersected with the PFA model (P) in the PFA model intersection unit 150 is a model that calculates the probability of the occurrence of the character string (w). Therefore, it is not possible to calculate the probability that an increment (for example, wa) of the string (w) will appear.

이에 증분 확률 계산부(160)는 문자열(w)의 증분이 나타날 확률을 계산할 수 있도록 증분에 대응하는 PFA 모델(P)과 교차된 DFA 모델(D)을 획득하고, 획득된 증분에 대한 PFA 모델(P)과 교차된 DFA 모델(D)을 이용하여, 문자열(w)과 그 증분 문자열이 출연할 확률을 계산한다.Accordingly, the incremental probability calculation unit 160 acquires the DFA model (D) intersected with the PFA model (P) corresponding to the increment so as to calculate the probability that the increment of the string (w) will appear, and the PFA model for the acquired increment Using the DFA model (D) intersected with (P), the probability that the character string w and the incremental character string appear is calculated.

본 실시예에서는 문자열(w)이 1회만 발생하는 문자열의 집합인 문자열 언어(F(w))를 정의한다. 여기서 문자열 언어(F(w))는 w가 접미사로만 1회 나타나는 문자열의 집합을 의미한다. 따라서 문자열 집합(F(w)ㅇΣ^*)은 문자열(w)이 포함되는 모든 문자열의 집합이며, 문자열 언어(F(w))에 대한 명료한 정규 표현식이 주어지면, 문자열 집합(Σ^*)과 결합하여, 문자열(w)이 포함되는 모든 문자열에 대한 명료한 정규 표현식을 생성할 수 있다.In the present embodiment, a character string language F(w), which is a set of character strings in which the character string w occurs only once, is defined. Here, the string language (F(w)) refers to a set of strings where w appears only once as a suffix. Thus, the set of strings (F(w)ㅇΣ ^* ) is the set of all strings containing the string (w), and given a clear regular expression for the string language (F(w)), the set of strings (Σ ^* ) Combined with, you can create a clear regular expression for all strings that contain string (w).

문자열 언어(F(w))에 대한 출현 확률은 PFA 모델 교차부(150)에서 획득된 PFA 모델(P)과 교차된 DFA 모델(D)을 이용하여 계산 될 수 있다.The probability of appearance for the character string language F(w) may be calculated using the DFA model D intersected with the PFA model P obtained at the PFA model intersection 150.

한편, 문자열(w)에 증분을 야기하는 문자(a ∈ Σ)에 대해, F(wa) = F(w)ㅇL를 만족하는 정규 언어(L)를 탐색한다. F(wa) = F(w)ㅇL을 만족하는 정규 언어(L)는 F(w)＼F(wa)로 획득될 수 있다. 여기서 ＼는 두 언어(R, S)가 주어질 때,

를 만족하는 언어 연산자이다.On the other hand, for a character (a ∈ Σ) causing an increment in the character string w, a regular language L satisfying F(wa) = F(w)ㅇL is searched. The regular language L satisfying F(wa) = F(w)ㅇL can be obtained as F(w)\F(wa). Where ＼ is given in two languages (R, S),

Is a language operator that satisfies

따라서 F(w)와 F(w)＼F(wa)가 주어지면, F(wa)를 획득할 수 있다. 즉 본 실시에에서는 F(wa)를 직접 획득하지 않고, 문자열 언어(F(w)) 중 문자열 언어(F(wa))에도 속하는 문자열 언어(F(w)＼F(wa))를 이용하여, F(wa) = F(w)ㅇF(w)＼F(wa)를 획득한다.Therefore, given F(w) and F(w)\F(wa), F(wa) can be obtained. That is, in the present embodiment, F(wa) is not obtained directly, and a character string language (F(w)\F(wa)) belonging to the character string language (F(wa)) among the character string languages (F(w)) is used. , F(wa) = F(w)ㅇF(w)\F(wa) is obtained.

도4 는 문자열과 그 증분에 대한 정규 표현식의 관계를 나타낸다.4 shows the relationship between a character string and a regular expression for its increment.

도4 에서는 문자열(w)이 문자(a)이고 증분 문자가 a, ab로 확장되는 경우를 나타낸다.Fig. 4 shows a case where the character string w is a character a and the incremental character is expanded to a and ab.

도4 을 참조하면, 문자열의 집합(Σ)의 문자가 a, b 인 경우(Σ = {a, b}) 문자열 언어(F(w))는 대한 정규 표현식의 형태 F(a) = b^*a로 표현된다. 그리고, 문자열(w)이 증분 문자(a)로 증분된 문자열 언어(F(wa))는 정규 표현식(F(aa) = b^*aㅇ(bb^*a)^*a)로 표현될 수 있다. 여기서 b^*a는 F(a)이며, (bb^*a)^*a는 F(a)＼F(aa)이다. 유사하게 증분 문자(b)가 추가로 증분된 문자열 언어(F(wab))는 F(aa)와 F(aa)＼F(aab)에 의해 정규 표현식(F(aab) = b^*aㅇ(bb^*a)^*aㅇa^*b)로 표현될 수 있다.Referring to Fig. 4, when the characters of the set of strings (Σ) are a and b (Σ = {a, b}), the form of the regular expression for the string language (F(w)) F(a) = b ^* Expressed by a In addition, the string language F(wa) in which the string w is incremented by the increment character a may be expressed as a regular expression (F(aa) = b ^* aㅇ(bb ^* a) ^* a). Here, b ^* a is F(a), and (bb ^* a) ^* a is F(a)\F(aa). Similarly, a string language (F(wab)) with an incremental character (b) further incremented is a regular expression (F(aab) = b ^* aㅇ() by F(aa) and F(aa)\F(aab). It can be expressed as bb ^* a) ^* aㅇa ^* b).

이는 이전 획득된 문자열 언어(F(w))를 기반으로 추가로 증분된 문자열 언어(F(wa), F(wab))에 대응하는 F(aa)＼F(aab)에 대한 명료한 정규 표현식을 표현하는 PFA 모델(P)과 교차된 DFA 모델(D), 즉 가중치가 적용된 DFA 모델(D)을 획득하면, 증분된 문자열 언어(F(wa), F(wab))에 대한 가중치가 적용된 DFA 모델(D)을 용이하게 획득할 수 있음을 의미한다.This is a clear regular expression for F(aa)\F(aab) corresponding to additionally incremented string languages (F(wa), F(wab)) based on the previously acquired string language (F(w)). When obtaining the DFA model (D) intersected with the PFA model (P) that expresses (D), that is, the weighted DFA model (D), the weighted string language (F(wa), F(wab)) is applied. It means that the DFA model (D) can be easily obtained.

문자열 언어(F(w = w₁, w₂, ..., w_n))에 대한 최종 상태(q_n+1)를 포함한 n+1개의 상태를 갖도록 획득한 DFA 모델(D)로부터 명료한 정규 표현식은

로 추출될 수 있으며, 유사하게 문자열 언어(F(w = w₁, w₂, ..., w_k))에 대한 명료한 정규 표현식은

로 추출될 수 있다.Clear from the DFA model (D) obtained to have n+1 states including the final state (q _n+1 ) for the string language (F(w = w ₁ , w ₂ , ..., w _n )) The regular expression is

Can be extracted as, and similarly clear regular expressions for string languages (F(w = w ₁ , w ₂ , ..., w _k ))

Can be extracted with

DFA 모델(D)에서 상태 제거 절차의 단계(k)에서 초기 상태(q₀)는 k-1까지의 상태들에만 연결되므로,

이다. 따라서 수학식 14를 이용하여, DFA 모델(D)의 상태들(q = q₁, q₂, ..., q_n ∈ Q)은 동적으로 반복하여 제거될 수 있다.In the DFA model (D), in step (k) of the state removal procedure, the initial state (q ₀ ) is connected only to states up to k-1,

to be. Therefore, using Equation 14, the states (q = q ₁ , q ₂ , ..., q _n ∈ Q) of the DFA model D can be dynamically and repeatedly removed.

한편, 수학식 14 는 수학식 15와 같이 단순하게 표현될 수 있다.Meanwhile, Equation 14 may be simply expressed as Equation 15.

수학식 15 에서

이며, 따라서 수학식 15는 수학식 16과 같이 표현될 수 있다.In Equation 15

Therefore, Equation 15 can be expressed as Equation 16.

그리고 수학식 16으로부터 F(w = w₁, w₂, ..., w_k-1)＼F(w = w₁, w₂, ..., w_k)는 수학식 17로 계산될 수 있다.And from Equation 16, F(w = w ₁ , w ₂ , ..., w _k-1 )\F(w = w ₁ , w ₂ , ..., w _k ) can be calculated by Equation 17 have.

수학식 17에 따라 F(aa)＼F(aab)에 대한 명료한 정규 표현식을 표현하는 을 획득할 수 있으며, 이에 대응하는 상태 및 전이 함수를 가중치가 적용된 DFA 모델(D)에 추가함으로써, 가중치가 적용된 DFA 모델(D)이 증분된 문자열 집합(F(wa) = F(w)ㅇF(w)＼F(wa))에 대응하도록 변형할 수 있다. 그리고 증분된 문자열(wa)에 대응하여 변형된 가중치가 적용된 DFA 모델(D)은 증분된 문자열(wa)에 대한 출현 확률을 계산하여 출력할 수 있다.According to Equation 17, it is possible to obtain a representing a clear regular expression for F(aa)\F(aab), and by adding a corresponding state and transition function to the weighted DFA model (D), The DFA model (D) to which is applied can be transformed to correspond to the incremented character string set (F(wa) = F(w)ㅇF(w)\F(wa)). In addition, the DFA model (D) to which the weight transformed in correspondence with the incremented character string (wa) is applied may calculate and output the appearance probability of the incremented character string (wa).

도5 는 도1 의 증분 확률 계산부가 DFA 모델로부터 문자열와 그 증분 문자열이 출연할 확률을 누적하여 계산하는 알고리즘을 나타낸다.FIG. 5 shows an algorithm in which the incremental probability calculation unit of FIG. 1 accumulates and calculates a character string and a probability that the incremental character string appears from a DFA model.

도5 의 알고리즘에서 DFA 모델(D)에 (n+3) × (n+3) 테이블(T, T')을 생성하는 것은 상기한 바와 같이, DFA 모델(D)이 문자열 언어(F(w = w₁, w₂, ..., w_n))에 대한 최종 상태(q_n+1)를 포함한 n+1개의 상태를 갖도록 획득되었기 때문이다.In the algorithm of Fig. 5, the generation of (n+3) × (n+3) tables (T, T') in the DFA model (D) is as described above, where the DFA model (D) is a character string language (F(w = w ₁ , w ₂ , ..., w _n )), including n+1 states including the final state (q _n+1 ).

그리고 벡터(

)는 이전 문자열로부터 기록된 결과이고, 초기값은

로 정해진다. 문자열(w = w₁, w₂, ..., w_k)에 대한 확률은

로 획득되며, 문자열(w = w₁, w₂, ..., w_k)은 도5 에 도시된 바와 같이, 증분될 수 있다.And vector(

) Is the result recorded from the previous string, and the initial value is

It is determined by The probability for a string (w = w ₁ , w ₂ , ..., w _k ) is

Is obtained, and the string (w = w ₁ , w ₂ , ..., w _k ) can be incremented as shown in FIG. 5.

도6 은 문자열의 길이와 상태 집합의 크기에 따른 증분 및 교차 방식의 적용에 따른 연산 소요 시간 측정 결과를 나타낸다.6 shows the result of measuring the time required for operation according to the application of the incremental and crossover method according to the length of a string and the size of a state set.

기존의 방식에서는 증분되는 모든 문자열에 대해 각각 출현 확률을 계산해야하므로 증분되는 모든 문자열의 개수에 따라 계산 시간이 기하급수적으로 증가되는 반면, 본 실시에에서는 증분되는 문자의 개수가 증가될 수록 상대적으로 문장열의 출현 확률을 계산하는 속도가 더욱 저감된다.In the conventional method, the probability of occurrence of each incremental character string has to be calculated, so the calculation time increases exponentially according to the number of incremental character strings, whereas in this embodiment, as the number of incremental characters increases, The speed of calculating the probability of occurrence of the sentence sequence is further reduced.

도6 에 도시된 예에서 본 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 장치를 이용하는 경우, 상태들의 집합(Q)의 크기(|Q|)가 1455이고, 문자열의 길이가 9까지 증분될 때, 최대 560.76%의 계산 속도 향상을 획득할 수 있음이 확인되었다.In the example shown in FIG. 6, when the automata-based incremental median probability calculation apparatus according to the present embodiment is used, when the size (|Q|) of the set of states (Q) is 1455, and the length of the string is incremented to 9 , It was confirmed that a maximum calculation speed improvement of 560.76% can be obtained.

도7 은 본 발명의 일 실시예에 따른 오토마타 기반 증분적 중위 확률 계산 방법을 나타낸다.7 shows an automata-based incremental median probability calculation method according to an embodiment of the present invention.

도1 을 참조하여, 도7 의 오토마타 기반 증분적 중위 확률 계산 방법을 설명하면, 우선 출현 확률을 계산할 문자열(w)을 획득한다(S10). 그리고, 획득된 문자열(w)의 각 문자에 대하여 DFA를 기반으로 DFA 모델(D)을 모델링하여 획득한다(S20).Referring to FIG. 1, the automata-based incremental median probability calculation method of FIG. 7 will be described. First, a character string w for calculating an appearance probability is obtained (S10). Then, the DFA model D is modeled and acquired based on the DFA for each character of the acquired string w (S20).

DFA 모델(D)은 획득된 문자열(w)에 대응하는 정규 언어(L(D))를 획득하기 위해 모델링되며, 정규 언어(L(D))를 정규 표현식의 형태로 획득할 수 있다.The DFA model (D) is modeled to obtain a regular language (L(D)) corresponding to the acquired string (w), and the regular language (L(D)) can be obtained in the form of a regular expression.

DFA 모델(D)이 획득되면, DFA 모델(D)의 상태 집합(Q)에 새로운 초기 상태(q₀)와 새로운 단일 최종 상태(q_n+1)를 추가하고 이에 대응하는 전이 함수(δ)를 추가하고, 상태 및 전이가 추가된 DFA 모델(D)에서 상태들(q = q₁, q₂, ..., q_n ∈ Q)을 수학식 1에 따라 동적으로 반복하여 제거함으로써, DFA 모델(D)이 명료한 정규 표현식을 획득할 수 있도록 변환한다(S30).When the DFA model (D) is obtained, a new initial state (q ₀ ) and a new single final state (q _n+1 ) are added to the state set (Q) of the DFA model (D), and a corresponding transition function (δ) And, by dynamically repeatedly removing states (q = q ₁ , q ₂ , ..., q _n ∈ Q) from the DFA model (D) to which states and transitions are added according to Equation 1, DFA The model (D) is transformed to obtain a clear regular expression (S30).

DFA 모델(D)에 상태 및 전이가 추가되어 변형되면, 정규 표현식의 각 문자를 기지정된 행렬로 맵핑하여 행렬 형태의 정규 표현 행렬로 변환한다(S40).When states and transitions are added to the DFA model (D) and transformed, each character of the regular expression is mapped to a predetermined matrix and converted into a regular expression matrix in the form of a matrix (S40).

그리고 PFA에 의해 모델링된 PFA 모델(P)을 기반으로 정규 표현 행렬로 변환된 DFA 모델(D)의 각 상태(즉 문자) 및 전이 경로의 가중치(P(w), P(π))를 계산하여, 변환된 DFA 모델(D)에 적용하여, PFA 모델(P)과 교차된 DFA 모델(D)을 획득한다(S50).And, based on the PFA model (P) modeled by the PFA, each state (i.e., character) of the DFA model (D) transformed into a regular expression matrix and the weight of the transition path (P(w), P(π)) are calculated. Thus, by applying the converted DFA model (D) to obtain a DFA model (D) crossing the PFA model (P) (S50).

PFA 모델(P)과 교차된 DFA 모델(D)은 문자열(w)에 대한 출현 확률을 계산할 수 있으나, 증분된 문자열에 대한 출현 확률을 계산할 수는 없으므로, 문자열(w)이 접미사로 1회만 출현하는 문자열 집합(F(w))에 대한 출현 확률과 문자열 집합(F(w))에서 증분된 문자열(wa)이 1회만 접미사로 출현하는 문자열 집합(F(wa))을 차감한 F(w)＼F(wa)가 출현할 확률을 계산할 수 있도록 PFA 모델(P)과 교차된 DFA 모델(D)에 추가 상태 및 전이 함수를 누적하여 적용하고, 추가 상태 및 전이 함수를 누적하여 적용된 DFA 모델(D)을 이용하여 F(wa) = F(w)ㅇF(w)＼F(wa)에 따라 증분된 문자열 집합(F(wa))에 대한 출현 확률을 계산한다(S60).The DFA model (D) intersected with the PFA model (P) can calculate the occurrence probability for the string (w), but cannot calculate the occurrence probability for the incremented string, so the string (w) appears only once as a suffix. F(w) subtracting the probability of occurrence for the set of strings (F(w)) and the set of strings (F(wa)) in which the incremented string (wa) appears as a suffix only once in the set of strings (F(w)) )DFA model applied by accumulating and applying additional state and transition functions to the DFA model (D) intersected with the PFA model (P) to calculate the probability of occurrence of \F(wa), and accumulating additional states and transition functions Using (D), the appearance probability of the set of character strings (F(wa)) incremented according to F(wa) = F(w)ㅇF(w)\F(wa) is calculated (S60).

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will appreciate that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

110: 문자열 획득부 120: DFA 모델 획득부
130: 정규 표현식 변환부 140: 행렬 변환부
150: PFA 모델 교차부 160: 증분 확률 계산부110: string acquisition unit 120: DFA model acquisition unit
130: regular expression conversion unit 140: matrix conversion unit
150: PFA model intersection 160: incremental probability calculation unit

Claims

A DFA model acquisition unit that acquires a DFA model composed of a plurality of state and transition functions in order to obtain a regular language in the form of a regular expression based on a deterministic finite automata (hereinafter referred to as DFA) for each character of the character string w to calculate the probability of appearance. ;
After adding an initial state, a single final state, and a transition function corresponding to the initial state and a single final state to the DFA model, paths that may overlap among the paths between the states of the DFA model are dynamically and repeatedly removed, and A regular expression conversion unit for converting a regular expression into a DFA model capable of expressing a regular expression;
A PFA model intersection that calculates the probability of each state and transition path of the transformed DFA model based on a PFA model obtained by a probabilistic finite automata (hereinafter referred to as PFA) and applies it as a weight of the transformed DFA model;
In the weighted DFA model, the character string (F(w)\F) included in the character string set (F(wa)) in which the incremented character string (wa) appears among the character string set (F(w)) including the character string (w) (wa)) by adding a state and transition function to obtain an incremental weighted DFA model, and by using the acquired incrementally weighted DFA model, an incremental probability calculation that calculates the occurrence probability of the string and the incremented string part; And
Automata-based incremental median probability calculation apparatus comprising a matrix transform unit for converting each character of the regular expression into a matrix form in the DFA model transformed to express a clear regular expression by the regular expression transform unit.

The method of claim 1, wherein the regular expression conversion unit
An initial state (q ₀ ) and a single final state (q _n+1 ) in the DFA model with a set of states (Q ∋ q) containing _n states (q = q ₁ , q ₂ , ..., q _n ). ) And the corresponding transfer function, and the equation

(here,

Denotes a set of strings starting from state (q _i ) and transitioning from the k (1 ≤ k ≤ n) th iteration elimination to the intermediate state (q _l ) (where l <k) to state (q _j ) ,

The state starting from a (q _i), through an intermediate state (q _k) represents all strings that transition to the state (q _j)) automata based incrementally median probability calculation apparatus in accordance with the repeated removal path.

The method of claim 2, wherein the PFA model intersection
Based on the PFA model (P) modeled based on the PFA, the weight of the converted DFA model (D) is calculated to express a clear regular expression corresponding to the string (w).

(Where L(D) is a clear regular expression for the string (w) represented by the transformed DFA model (D))
An automata-based incremental median probability calculation device that is acquired and applied according to.

The method of claim 3, wherein the incremental probability calculation unit
Duplicate strings (F(w)\) included in the string set (F(wa)) in which the incremented string (wa) appears among the string sets (F(w)) including the string (w) in the weighted DFA model. Equation of state and transition function for F(wa))

An automata-based incremental median probability calculation device that acquires a DFA model to which an incremental weight is applied by acquiring according to and adding it to a weighted DFA model.

delete

Obtaining a DFA model composed of a plurality of state and transition functions in order to obtain a regular language in the form of a regular expression based on a deterministic finite automata (hereinafter referred to as DFA) for each character of the character string w for calculating an appearance probability;
After adding an initial state, a single final state, and a transition function corresponding to the initial state and a single final state to the DFA model, paths that may overlap among the paths between the states of the DFA model are dynamically and repeatedly removed, and Converting a regular expression into a DFA model capable of representing;
Calculating the probability of each state and transition path of the transformed DFA model based on a PFA model obtained with a probabilistic finite automata (hereinafter referred to as PFA), and applying the transformed DFA model as a weight;
In the weighted DFA model, the character string (F(w)\F) included in the character string set (F(wa)) in which the incremented character string (wa) appears among the character string set (F(w)) including the character string (w) obtaining a DFA model to which an incremental weight is applied by adding a state and a transition function for (wa)), and calculating the occurrence probability of a character string and the incremented character string using the obtained DFA model to which the incremental weight is applied; And
Automata-based incremental median probability calculation method comprising converting each character of the regular expression into a matrix form in a transformed DFA model to represent a clear regular expression.

The method of claim 6, wherein converting to the DFA model
An initial state (q ₀ ) and a single final state (q _n+1 ) in the DFA model with a set of states (Q ∋ q) containing _n states (q = q ₁ , q ₂ , ..., q _n ). ) And a transfer function corresponding thereto; And
Equation

(here,

Starting from state (q _i ) and repeatedly removing paths according to all strings that transition to state (q _j ) through intermediate state (q _k ); Automata-based incremental median probability calculation method comprising a.

The method of claim 7, wherein applying the weight of the DFA model
Based on the PFA model (P) modeled based on the PFA, the weight of the converted DFA model (D) is calculated to express a clear regular expression corresponding to the string (w).

(Where L(D) is a clear regular expression for the string (w) represented by the transformed DFA model (D))
Automata-based incremental median probability calculation method obtained and applied according to.

The method of claim 8, wherein calculating the probability of occurrence
Duplicate strings (F(w)\) included in the string set (F(wa)) in which the incremented string (wa) appears among the string sets (F(w)) including the string (w) in the weighted DFA model. Equation of state and transition function for F(wa))

Acquiring according to the method and adding to the weighted DFA model;
Obtaining a DFA model to which incremental weights are applied according to the weighted DFA model to which the state and transition function are added; And
Calculating a probability of occurrence of an incremental character string using the DFA model to which the incremental weight is applied; Automata-based incremental median probability calculation method comprising a.

delete