KR101417928B1

KR101417928B1 - Method for Structuring Natural Language And Mathematical Formula, Apparatus And Computer-Readable Recording Medium with Program Therefor

Info

Publication number: KR101417928B1
Application number: KR1020100133761A
Authority: KR
Inventors: 최승락; 박용길; 박근태; 이동학; 최형인; 위남숙; 이두석; 손정교; 김행문; 이종헌; 이명성
Original assignee: 주식회사 아이싸이랩; 에스케이 텔레콤주식회사
Priority date: 2010-12-23
Filing date: 2010-12-23
Publication date: 2014-07-14
Also published as: KR20120072001A

Abstract

본 발명의 일 실시예는 자연어 및 수식 구조화 방법과 그를 위한 장치 및 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.
본 발명의 일 실시예는 자연어(Natural Language) 및 수식(Mathematical Formula)의 조합으로 이루어진 조합 데이터를 입력받는 정보 입력부; 상기 조합 데이터에서 상기 자연어 및 상기 수식을 각각 분리하는 분리부; 분리된 상기 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분하는 자연어 처리부; 분리된 상기 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분하는 수식 처리부; 및 상기 제 1 정보, 상기 제 2 정보, 상기 자연어 및 상기 수식 중 적어도 하나 이상의 정보를 재조합하여 재조합 데이터로 저장하는 데이터 관리부를 포함하는 것을 특징으로 하는 자연어 및 수식 구조화 장치를 제공한다.
본 발명의 일 실시예에 의하면, 자연어 및 수식이 함께 조합된 표준 문서를 이용하여 향후 수학 컨텐츠 검색 시 키워드로 이용할 수 있는 효과가 있다.An embodiment of the present invention relates to a method for structuring natural language and mathematical expressions, an apparatus therefor, and a computer-readable recording medium.
According to an embodiment of the present invention, there is provided an information processing apparatus comprising: an information input unit for inputting combination data composed of a combination of a natural language and a mathematical formula; A separator for separating the natural language and the expression from the combination data; A natural language processing unit for analyzing each first information constituting the separated natural language and classifying the first information according to a specific meaning; A mathematical expression processing unit for analyzing each of the second information constituting the separated mathematical expression and classifying the second information according to a specific meaning; And a data management unit for re-assembling at least one of the first information, the second information, the natural language, and the formula, and storing the recombined data as the recombination data.
According to an embodiment of the present invention, there is an effect that a keyword can be used as a keyword when searching for math contents using a standard document in which natural language and mathematical expression are combined together.

Description

TECHNICAL FIELD The present invention relates to a method for structuring a natural language and a formula, an apparatus therefor, and a computer-readable recording medium,

본 발명의 일 실시예는 자연어 및 수식 구조화 방법과 그를 위한 장치 및 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다. 더욱 상세하게는, 일반적인 자연어 처리 기술으로는 자연어와 수식이 조합된 데이터를 분석하기 어려우므로, 자연어 처리(Natural Language Processing) 및 수식 처리를 함께 수행한 분석 내용에 기초하여 자연어 및 수식을 재조합한 데이터로 관리할 수 있도록 하는 자연어 및 수식 구조화 방법과 그를 위한 장치 및 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.An embodiment of the present invention relates to a method for structuring natural language and mathematical expressions, an apparatus therefor, and a computer-readable recording medium. More specifically, since it is difficult to analyze data in which a combination of a natural word and a mathematical expression is analyzed by a general natural language processing technique, data obtained by recombining natural language and mathematical expression based on the analysis contents in which the natural language processing A method for structuring a natural language and a formula, an apparatus therefor, and a computer readable recording medium.

이 부분에 기술된 내용은 단순히 본 발명의 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the embodiment of the present invention and do not constitute the prior art.

사람의 언어는 풍부하고 복잡하며, 복잡한 문법 및 문맥 의미를 갖는 엄청난 어휘를 포함하고 있으나 기계 또는 소프트웨어 애플리케이션은 일반적으로 특정 형식 또는 규칙에 따라 데이터를 입력할 것을 요구한다. 여기서, 자연어 입력은 사람과 상호작용하기 위한 거의 모든 소프트웨어 애플리케이션에서 이용될 수 있다. 일반적인 자연어 처리 과정은 자연어를 토큰(Token)으로 분리하고 소프트웨어 애플리케이션에 의해 제공되는 하나 이상의 동작에 매핑하며 각각의 소프트웨어 애플리케이션이 일련의 고유한 동작들을 가지도록 설정된다. 즉, 소프트웨어 개발자가 자연어 입력을 해석하는 코드를 작성하여 입력을 각각의 애플리케이션에 대한 적절한 동작에 매핑하는 방식이다.The language of a person is rich and complex, and includes enormous vocabularies with complex grammatical and contextual meanings, but machine or software applications generally require data to be entered according to a particular format or rule. Here, natural language input can be used in almost all software applications for interacting with people. A typical natural language processing process separates the natural language into tokens, maps to one or more operations provided by the software application, and each software application is set to have a set of unique operations. That is, software developers write code that interprets natural language input and maps the input to the appropriate behavior for each application.

하지만, 이러한 자연어 처리 방식은 수식을 인식하지 못하는 문제가 있었다.However, there is a problem that the natural language processing method does not recognize the formula.

전술한 문제점을 해결하기 위해 본 발명의 일 실시예는, 자연어와 수식이 조합된 데이터에 대해 자연어 처리 및 수식 처리를 함께 수행한 분석 내용에 기초하여 자연어 및 수식을 재조합한 데이터로 관리할 수 있도록 하는 자연어 및 수식 구조화 방법과 그를 위한 장치 및 컴퓨터로 읽을 수 있는 기록매체를 제공하는 데 주된 목적이 있다.According to an embodiment of the present invention, there is provided a method of managing natural language and mathematical expression data by combining natural language and mathematical expression based on analysis results obtained by performing natural language processing and mathematical expression processing on data combined with natural language and mathematical expression A method for structuring natural language and expressions, an apparatus therefor, and a computer-readable recording medium.

전술한 목적을 달성하기 위해 본 발명의 일 실시예는, 자연어(Natural Language) 및 수식(Mathematical Formula)의 조합으로 이루어진 조합 데이터를 입력받는 정보 입력부; 상기 조합 데이터에서 상기 자연어 및 상기 수식을 각각 분리하는 분리부; 분리된 상기 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분하는 자연어 처리부; 분리된 상기 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분하는 수식 처리부; 및 상기 제 1 정보, 상기 제 2 정보, 상기 자연어 및 상기 수식 중 적어도 하나 이상의 정보를 재조합하여 재조합 데이터로 저장하는 데이터 관리부를 포함하는 것을 특징으로 하는 자연어 및 수식 구조화 장치를 제공한다.According to an aspect of the present invention, there is provided an information processing apparatus comprising: an information input unit for receiving combination data composed of a combination of a natural language and a mathematical formula; A separator for separating the natural language and the expression from the combination data; A natural language processing unit for analyzing each first information constituting the separated natural language and classifying the first information according to a specific meaning; A mathematical expression processing unit for analyzing each of the second information constituting the separated mathematical expression and classifying the second information according to a specific meaning; And a data management unit for re-assembling at least one of the first information, the second information, the natural language, and the formula, and storing the recombined data as the recombination data.

또한, 본 발명의 다른 목적에 의하면, 자연어(Natural Language) 및 수식(Mathematical Formula)의 조합으로 이루어진 조합 데이터를 입력받는 정보 입력부; 상기 조합 데이터에서 상기 자연어 및 상기 수식을 각각 분리하는 분리부; 분리된 상기 자연어를 토큰화(Tokenization)한 자연어 토큰을 생성하고, 상기 자연어 토큰을 근거로 중지 단어(Stop Word)를 필터링한 중지 단어 필터링 데이터를 생성하며, 상기 중지 단어 필터링 데이터에서 중복 제거 필터링을 수행한 중복 제거 필터링 데이터를 생성한 후 상기 중복 제거 필터링 데이터에 기 정의된 의미가 부여된 동작 정보를 매칭하는 자연어 처리부; 분리된 상기 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분하는 수식 처리부; 및 상기 자연어 토큰, 상기 동작 정보, 상기 제 2 정보, 상기 자연어 및 상기 수식 중 적어도 하나 이상의 정보를 재조합하여 재조합 데이터로 저장하는 데이터 관리부를 포함하는 것을 특징으로 하는 자연어 및 수식 구조화 장치를 제공한다.According to another aspect of the present invention, there is provided an information processing apparatus comprising: an information input unit for receiving combination data composed of a combination of a natural language and a mathematical formula; A separator for separating the natural language and the expression from the combination data; Generates stop word filtering data in which a stop word is filtered based on the natural language token, generates a stop word filtering data by filtering the stop word filtering data, A natural language processing unit for generating the de-duplication filtering data and matching the de-duplication filtering data with the pre-defined meaningful operation information; A mathematical expression processing unit for analyzing each of the second information constituting the separated mathematical expression and classifying the second information according to a specific meaning; And a data management unit configured to recombine at least one of the natural language token, the operation information, the second information, the natural language, and the mathematical expression, and store the recombined data as the recombination data.

또한, 본 발명의 다른 목적에 의하면, 자연어(Natural Language) 및 수식(Mathematical Formula)의 조합으로 이루어진 조합 데이터를 입력받는 정보 입력부; 상기 조합 데이터에서 상기 자연어 및 상기 수식을 각각 분리하는 분리부; 분리된 상기 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분하는 자연어 처리부; 분리된 상기 수식을 트리 형태로 변환하고, 상기 트리 형태로 변환된 수식에 횡단(Traverse) 과정을 수행하며, 상기 횡단 과정이 수행된 수식에 토큰화를 수행한 수식 토큰을 생성하는 수식 처리부; 및 상기 제 1 정보, 수식 토큰, 상기 자연어 및 상기 수식 중 적어도 하나 이상의 정보를 재조합하여 재조합 데이터로 저장하는 데이터 관리부를 포함하는 것을 특징으로 하는 자연어 및 수식 구조화 장치를 제공한다.According to another aspect of the present invention, there is provided an information processing apparatus comprising: an information input unit for receiving combination data composed of a combination of a natural language and a mathematical formula; A separator for separating the natural language and the expression from the combination data; A natural language processing unit for analyzing each first information constituting the separated natural language and classifying the first information according to a specific meaning; A mathematical expression processing unit for converting the separated expression into a tree form, performing a traverse process on the expression converted into the tree form, and generating a mathematical token in which the tokenization is performed on the mathematical expression in which the traversal process is performed; And a data management unit for recombining at least one of the first information, the expression token, the natural language, and the expression, and storing the recombined data as the recombination data.

또한, 본 발명의 다른 목적에 의하면, 자연어(Natural Language) 및 수식(Mathematical Formula)의 조합으로 이루어진 조합 데이터를 입력받는 정보 입력 단계; 상기 조합 데이터에서 상기 자연어 및 상기 수식을 각각 분리하는 분리 단계; 분리된 상기 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분하는 자연어 처리 단계; 분리된 상기 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분하는 수식 처리 단계; 및 상기 제 1 정보, 상기 제 2 정보, 상기 자연어 및 상기 수식 중 적어도 하나 이상의 정보를 재조합하여 재조합 데이터로 저장하는 데이터 관리 단계를 포함하는 것을 특징으로 하는 자연어 및 수식 구조화 방법을 제공한다.According to another aspect of the present invention, there is provided an information input method comprising: inputting combination data composed of a combination of a natural language and a mathematical formula; A separating step of separating the natural language and the expression from the combination data, respectively; A natural language processing step of analyzing each first information constituting the separated natural language and classifying the first information according to a specific meaning; A mathematical expression processing step of analyzing each of the second information constituting the separated mathematical expression and classifying it according to a specific meaning; And a data management step of re-assembling at least one of the first information, the second information, the natural language, and the formula, and storing the recombined data as the recombination data.

또한, 본 발명의 다른 목적에 의하면, 자연어 및 수식 구조화 방법의 각 단계를 실행하기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.According to another aspect of the present invention, there is provided a computer-readable recording medium storing a program for executing each step of a natural language and a method for structuring a mathematical expression.

이상에서 설명한 바와 같이 본 발명의 일 실시예에 의하면, 자연어와 수식이 조합된 데이터에 대해 자연어 처리 및 수식 처리를 함께 수행한 분석 내용에 기초하여 자연어 및 수식을 재조합한 데이터로 관리할 수 있도록 하는 효과가 있다.As described above, according to an embodiment of the present invention, it is possible to manage natural language and mathematical expression data by recombining data based on analysis contents in which natural language processing and mathematical expression processing are performed on data combined with natural language and mathematical expression It is effective.

또한, 본 발명의 일 실시예에 의하면, 자연어 및 수식을 분석한 정보를 재조합된 하나의 데이터로서 관리하며 수학 관련 표준 문서로 변환할 수 있는 효과가 있다. 또한, 본 발명의 일 실시예에 의하면, 자연어 및 수식이 함께 조합된 표준 문서를 이용하여 향후 수학 컨텐츠 검색 시 키워드로 이용할 수 있는 효과가 있다. In addition, according to an embodiment of the present invention, information obtained by analyzing natural language and mathematical expression is managed as a recombined data and converted into a mathematical standard document. In addition, according to an embodiment of the present invention, there is an effect that a standard document combining natural language and mathematical expression together can be used as a keyword in searching for math contents in the future.

또한, 본 발명의 일 실시예에 의하면, 자연어를 통해 획득된 시멘틱(Semantic) 정보를 자연어 토큰으로 정의하여, 자연어의 구성 패턴을 파악하고 자연어에서 서술하는 동작을 파악할 수 있는 효과가 있으며, 수식을 통해 획득한 시멘틱 정보를 수학 언어(Mathematics Natural Language) 토큰으로 정의하여, 자연어 및 수식을 포함한 형태의 데이터로 관리할 수 있는 효과가 있다.According to an embodiment of the present invention, semantic information obtained through a natural language can be defined as a natural language token, thereby grasping a configuration pattern of a natural language and grasping an operation described in a natural language. The semantic information obtained through the above-described method can be defined as a Mathematics Natural Language token, and the data can be managed in the form of data including natural language and mathematical expressions.

도 1은 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 장치를 개략적으로 나타낸 블럭 구성도,
도 2는 본 발명의 일 실시예에 따른 자연어 처리부를 개략적으로 나타낸 블럭 구성도,
도 3은 본 발명의 일 실시예에 따른 수식 처리부를 개략적으로 나타낸 블럭 구성도,
도 4는 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 방법을 설명하기 위한 순서도,
도 5는 본 발명의 일 실시예에 따른 수식의 트리 형태 표현을 나타낸 예시도,
도 6은 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 장치가 클라우드 컴퓨팅으로 데이터를 제공하는 시스템에 대한 예시도,
도 7은 본 발명의 일 실시예에 따른 자연어 및 수식을 구성하고 있는 정보를 분석하여 특정 의미에 따라 구분하는 방법에 대한 예시도이다.1 is a block diagram schematically showing an apparatus for structuring natural language and mathematical expressions according to an embodiment of the present invention,
2 is a block diagram schematically showing a natural language processing unit according to an embodiment of the present invention,
3 is a block diagram schematically showing a modification processing unit according to an embodiment of the present invention.
4 is a flowchart for explaining a method of structuring a natural language and an equation according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating a tree form representation of a formula according to an embodiment of the present invention. FIG.
6 is a diagram illustrating an example of a system in which a natural language and expression structuring apparatus according to an embodiment of the present invention provides data in a cloud computing;
7 is a diagram illustrating an example of a method of analyzing information constituting a natural language and a formula according to an embodiment of the present invention and classifying the information according to a specific meaning.

본 발명에 기재된 자연어 및 수식 구조화 장치(100)는 자연어(Natural Language) 및 수식(Mathematical Formula)의 조합으로 이루어진 조합 데이터에서 자연어 및 수식 별로 구조화(DB화) 하기 위한 장치를 말하며, 자연어 및 수식 수식 구조화 장치(100)는 하드웨어 또는 소프트웨어로 구현되어, 서버 또는 단말에 탑재될 수 있다.The natural language and mathematical expression structuring apparatus 100 according to the present invention refers to an apparatus for structuring (DB) by natural language and mathematical expression in combination data composed of a combination of a natural language and a mathematical formula, The structuring apparatus 100 may be implemented by hardware or software, and may be mounted on a server or a terminal.

도 1은 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 장치를 개략적으로 나타낸 블럭 구성도이다.1 is a block diagram schematically showing a natural language and expression structuring apparatus according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 자연어 및 수식 구조화 장치(100)는 정보 입력부(110), 분리부(120), 자연어 처리부(130), 수식 처리부(140) 및 데이터 관리부(150)를 포함한다. 한편, 본 발명의 일 실시예에서는 자연어 및 수식 구조화 장치(100)가 정보 입력부(110), 분리부(120), 자연어 처리부(130), 수식 처리부(140) 및 데이터 관리부(150)만을 포함하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 자연어 및 수식 구조화 장치(100)에 포함되는 구성 요소에 대하여 다양하게 수정 및 변형하여 적용 가능할 것이다.The natural language and mathematical structuring apparatus 100 according to an embodiment of the present invention includes an information input unit 110, a separating unit 120, a natural language processing unit 130, a mathematical expression processing unit 140, and a data management unit 150. In an embodiment of the present invention, the natural language and mathematical structuring apparatus 100 includes only the information input unit 110, the separating unit 120, the natural language processing unit 130, the modification processing unit 140, and the data management unit 150 It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims. Various modifications and changes may be made to the elements included in the natural language and mathematical expression structuring apparatus 100 within the scope of the present invention.

정보 입력부(110)는 자연어 및 수식의 조합으로 이루어진 조합 데이터를 입력받는다. 여기서, 조합 데이터는 수학 문제, 수식 증명 등을 포함한 수학 컨텐츠인 것이 바람직하나 반드시 이에 한정되는 것은 아니다. 또한, 자연어 및 수식의 조합으로 이루어진 조합 데이터는 사용자의 조작 또는 명령에 의해 직접 입력될 수 있으나 반드시 이에 한정되는 것은 아니며, 별도의 외부 서버로부터 자연어 및 수식의 조합으로 이루어진 문서 데이터를 입력받을 수도 있을 것이다. 분리부(120)는 조합 데이터에서 자연어 및 수식을 각각 분리한다. 즉, 분리부(120)는 정보 입력부(110)를 통해 자연어 및 수식의 조합으로 이루어진 조합 데이터가 입력되면, 조합 데이터에 포함된 자연어와 수식을 각각 분리하여 인식하는 것이다.The information input unit 110 receives combination data composed of combinations of natural language and mathematical expressions. Here, the combination data is preferably mathematical content including mathematical problems, proofs of formulas, but is not limited thereto. In addition, the combination data composed of a combination of natural language and mathematical expression can be directly input by a user's operation or command, but not necessarily limited thereto. Document data composed of natural language and combination of mathematical expressions may be input from a separate external server will be. The separator 120 separates the natural language and the expression from the combination data, respectively. That is, when the combination data composed of a combination of natural language and mathematical expression is input through the information input unit 110, the separation unit 120 separates and recognizes the natural language and the mathematical expression included in the combination data.

자연어 처리부(130)는 분리된 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분한다. 한편, 자연어 처리부(130)가 특정 의미를 파악하기 위해 수행하는 동작에 대해 구체적으로 설명하자면, 자연어 처리부(130)는 자연어를 구성하고 있는 제 1 정보를 분석한 후 문장의 구조 및 포함된 키워드 중 적어도 하나 이상의 정보를 이용하여 특정 의미를 파악할 수 있다. 즉, 자연어 처리부(130)는 기 설정된 룰(Rule) 기반으로 동작하여 특정 의미를 파악할 수 있으며, 자연어 처리부(130)가 자연어를 구성하고 있는 제 1 정보를 분석하여 특정 의미에 따라 구분하는 구체적인 방법에 대해서는 도 7을 통해 설명하도록 한다.The natural language processing unit 130 analyzes the first information constituting the separated natural language and classifies the first information according to a specific meaning. The natural language processing unit 130 analyzes the first information constituting the natural language and then analyzes the structure of the sentence and the keyword included in the keyword A specific meaning can be grasped by using at least one information. That is, the natural language processing unit 130 operates on the basis of a predetermined rule and can grasp a specific meaning. The natural language processing unit 130 analyzes a first information constituting a natural language and divides the information according to a specific meaning Will be described with reference to FIG.

자연어 처리부(130)는 자연어를 토큰화(Tokenization)한 자연어 토큰을 생성한다. 여기서, 토큰(Token)이란 연속된 문장에서 구별할 수 있는 단위를 말하며, 토큰화는 자연어를 자연어 및 수식 구조화 장치(100)가 이해할 수 있는 단위인 워드(Word) 단위로 쪼개는 과정을 말한다. 토큰화에 대해 좀더 구체적으로 설명하자면, 본 발명의 일 실시예에서 토큰화는 크게 자연어 토큰화와 수식 토큰화로 구분된다. 자연어 토큰화란 조합 데이터(수학 문제)에 포함된 자연어를 공백(Space)을 기준으로 분리한 결과물에 해당하는 각각의 어절(Word)을 자연어 토큰으로 인식하는 과정을 말한다. 각 토큰의 의미를 좀 더 명확히 파악하기 위하여 토큰에 대한 형태소 분석을 추가적으로 수행할 수도 있다. 한편, 수식 토큰화란 조합 데이터(수학 문제)에 포함되는 수식을 파싱(Parsing)한 후 얻게되는 개별 단위 정보를 수식 토큰으로 인식하는 과정을 말한다.The natural language processing unit 130 generates a natural language token obtained by tokenizing a natural language. Token refers to a unit that can be distinguished in a continuous sentence. Tokenization refers to a process of dividing a natural language into units of words (Word), which is a unit understood by the natural language and mathematical structuring apparatus 100. More specifically tokenization, in one embodiment of the present invention tokenization is largely divided into natural language tokenization and formal tokenization. Natural tokenization is a process of recognizing each word (Word) corresponding to the result obtained by separating natural words included in the combination data (mathematical problem) on the basis of a space as a natural language token. In order to more clearly understand the meaning of each token, an additional morphological analysis of the token may be performed. Meanwhile, the formula tokenization process refers to a process of recognizing individual unit information obtained after parsing a formula included in a combination data (mathematical problem) as a formula token.

[예제 1][Example 1]

예를 들어서, [예제 1]에서 자연어 토큰에 해당하는 정보는 'Find', 'the', 'function', 'value', 'with' 가 되며, 수식 토큰은 파싱을 통해서 정보를 추출한 후에 반환되는 값인 다항식(Polynomial), 최고 차수(Max degree=3), 항의 수(Number of terms=4), 조건(Condition) 등이 될 수 있다.For example, in [Example 1], the information corresponding to the natural language token is 'Find', 'the', 'function', 'value', 'with', and the formula token is returned after parsing (Polynomial), Max degree = 3, Number of terms = 4, Condition, and so on.

자연어 처리부(130)는 자연어 토큰을 근거로 중지 단어(Stop Word)를 필터링한 단어 필터링 데이터를 생성하며, 중지 단어 필터링 데이터에서 중복 제거 필터링을 수행한 중복 제거 필터링 데이터를 생성한다. 여기서, 중지 단어란 문장이나 수식의 분석에 있어서 필요없는 토큰에 해당하는 부분을 제거하기 위해서 미리 정의해 놓은 단어들의 집합을 의미한다. 즉, [예제 1]에서 'the'(이외에도 a나 to 등)는 시스템에서 사전(Dictionary) 형태로 미리 정의되어 있다. 여기서, 사전은 단어의 집합을 포함하는 리스트를 의미한다. 즉, 자연어 토큰을 생성한 후 자연어 처리부(130)에서는 분석에 필요없는 부분인 중지단어를 제거하는 과정을 수행하게 되는데, 중지 단어 필터링은 수학 문제가 길어질 경우(서술형 문제 등)에 분석 과정에 너무 많은 토큰이 들어가는 것을 방지해 주며, 더불어 시스템의 처리 속도를 향상 시키기 위해 동작한다.The natural language processing unit 130 generates word filtering data by filtering the stop word based on the natural language token and generates the deduplication filtering data by performing the deduplication filtering on the stop word filtering data. Here, a stop word refers to a set of words that are predefined in order to remove a part corresponding to a token which is unnecessary in the analysis of a sentence or an expression. That is, in [Example 1] 'the' (other than a or to) is predefined as a dictionary in the system. Here, the dictionary means a list including a set of words. That is, after generating the natural language token, the natural language processing unit 130 performs a process of removing a stop word, which is a part unnecessary in the analysis. The stop word filtering is performed in a case where a mathematical problem is prolonged It prevents many tokens from entering and also works to improve the processing speed of the system.

자연어 처리부(130)는 중복 제거 필터링 데이터에 기 정의된 의미가 부여된 동작(Action) 정보를 매칭한다. 여기서, 동작 정보는 자연어 토큰 또는 수식 토큰을 바탕으로 추출할 수 있는 요약 정보를 의미한다. 예를 들어서, [예제 1]에서 자연어 토큰 또는 수식 토큰을 바탕으로 '풀다(Solve)'라는 동작 정보를 추출할 수 있다. 여기서, 중복 제거 필터링 데이터에서 술어에 해당하는 데이터를 동작 정보와 매칭 저장하는 이유는 조합 데이터(수학 문제)를 스키마(Schema)로 정의하는 과정에서 전체 문장이 의미하는 대표 동작에 대한 정보를 획득하여 이후에 검색 또는 문제간의 연관성(Similarity)을 분석할 때 도움이 되는 도구로 활용하기 위함이다.The natural language processing unit 130 matches the action information assigned with the predefined meaning to the deduplication filtering data. Here, the operation information means summary information that can be extracted based on a natural language token or an expression token. For example, in [Example 1], action information called 'Solve' can be extracted based on a natural language token or a mathematical token. Here, the reason why the data corresponding to the predicate in the deduplication filtering data is matched with the operation information is that in the process of defining the combination data (mathematical problem) as a schema, information on the representative operation of the whole sentence is obtained This is to be used later as a tool to assist in analyzing the similarity between searches or problems.

자연어 처리부(130)는 자연어를 구성하고 있는 제 1 정보에 대해 토큰화를 수행하여 자연어 토큰을 생성한다. 자연어 처리부(130)는 자연어 토큰에서 기 설정된 중지 단어로 판별된 자연어 토큰을 선별하여 제거하는 중지 단어 필터링을 수행하여 중지 단어 필터링 데이터를 생성한다. 자연어 처리부(130)는 중지 단어 필터링 데이터에서 중복되는 데이터를 선별하여 제거하는 중복 제거 필터링을 수행하여 중복 제거 필터링 데이터를 생성한다. 자연어 처리부(130)는 중복 제거 필터링 데이터에서 술어에 해당하는 데이터를 기 정의된 의미가 부여된 동작 정보와 매칭 저장한다. The natural language processing unit 130 generates a natural language token by performing tokenization on the first information constituting the natural language. The natural language processing unit 130 generates stop word filtering data by performing stop word filtering for selecting and removing the natural language tokens identified as predetermined stop words in the natural language token. The natural language processing unit 130 performs deduplication filtering to selectively remove redundant data from the stop word filtering data to generate deduplication filtering data. The natural language processing unit 130 stores the data corresponding to the predicate in the deduplication filtering data and stores the matching operation information with the predefined meaning.

수식 처리부(140)는 분리된 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분한다. 한편, 수식 처리부(140)가 특정 의미를 파악하기 위해 수행하는 동작에 대해 구체적으로 설명하자면, 수식 처리부(140)는 수식를 구성하고 있는 제 2 정보를 분석한 후 수식의 종류 정보를 이용하여 특정 의미를 파악할 수 있다. 즉, 수식 처리부(140)는 기 설정된 룰(Rule) 기반으로 동작하여 특정 의미를 파악할 수 있으며, 수식 처리부(140)가 수식을 구성하고 있는 제 2 정보를 분석하여 특정 의미에 따라 구분하는 구체적인 방법에 대해서는 도 7을 통해 설명하도록 한다.The mathematical expression processing unit 140 analyzes each of the second information constituting the separated mathematical expression and classifies it according to a specific meaning. To describe the operation performed by the expression processor 140 in order to grasp a specific meaning, the expression processor 140 analyzes the second information constituting the expression, and then, using the expression information of the expression, . That is, the mathematical expression processing unit 140 can operate on the basis of a predetermined rule to grasp a specific meaning, and the mathematical expression processing unit 140 analyzes the second information constituting the mathematical expression, Will be described with reference to FIG.

수식 처리부(140)는 수식을 트리 형태로 변환하고, 트리 형태로 변환된 수식에 횡단(Traverse) 과정을 수행하고, 횡단 과정이 수행된 수식에 토큰화를 수행한다. 수식 처리부(140)는 Math ML(Mathematical Markup Language)로 작성된 수식을 XML 트리 형태로 변환한 후 DOM(Document Object Model) 형태로 변환한다. 수식 처리부(140)는 수식을 구성하는 제 2 정보를 최하단 노드에서 점차 상위 노드로 전달되도록 하는 깊이 우선 검색(Depth-First Search) 방식으로 횡단을 실행한다. 한편, 횡단 과정과 깊이 우선 검색에 대해 구체적으로 설명하자면, 일반적으로 수식은 Math ML의 형태를 띄고 있으며, 이는 트리의 형태로 구성이 되며, 이러한, 트리를 횡단하는 과정을 횡단 과정이라 칭하며, 횡단 과정을 수행할 때, 깊이 우선 검색(Depth-First Search)을 사용한다. 이러한, 횡단 과정은 트리의 루트(Root)에서 시작하여 자식 노드까지 들어간 후 모든 자식 노드의 검색이 끝나면 부모 노드로 이동하기 때문에, 자식 노드에서 가지고 있는 정보 모두를 부모 노드로 전달한다. 시간 복잡도 측면에서 엣지(Edge)의 수만큼만 검색을 수행하면 됨으로 효율적이다.The mathematical expression processing unit 140 converts a mathematical expression into a tree form, performs a traverse process on a mathematical expression transformed into a tree form, and performs tokenization on a mathematical expression in which a traversal process is performed. The mathematical expression processing unit 140 converts mathematical expressions written in Math ML (Mathematical Markup Language) into an XML tree form, and then converts it into a DOM (Document Object Model) form. The mathematical expression processing unit 140 executes the traversal in a depth-first search method in which the second information constituting the equation is gradually transferred from the lowermost node to the upper node. On the other hand, to describe the transversal process and the depth-first search in detail, in general, the formula is in the form of a Math ML, which is configured in the form of a tree. Such a process of traversing a tree is called a traversal process, When performing the process, use Depth-First Search. The traversal process starts from the root of the tree and moves to the child node. When all the child nodes are searched, the child node moves to the parent node. Therefore, all the information held by the child node is transferred to the parent node. In terms of time complexity, it is efficient to search only the number of edges.

데이터 관리부(150)는 자연어 처리부(130)를 통해 분석된 제 1 정보, 수식 처리부(140)를 통해 분석된 제 2 정보, 분리부(120)를 통해 인식된 자연어 및 수식 중 적어도 하나 이상의 정보를 재조합하여 재조합 데이터로 저장한다. 데이터 관리부(150)는 재조합된 데이터를 문서 데이터로 변환한다. 한편, 데이터 관리부(150)는 제 1 정보, 제 2 정보, 자연어 및 수식이 하나의 XML(eXtended Markeup Language) 트리로 저장되도록 XML을 정의 할 수 있으나, 본 실시예에서 그에 대한 구체적인 예시는 생략토록 한다. 다만, 제 1 정보, 제 2 정보, 자연어 및 수식이 정의된 XML에 대해 개략적으로 설명하자면, 정의된 XML은 형태는 크게 두 부분으로 구분될 수 있는데, 첫 번째는 '문제 묘사' 부분과, 두 번째는 자연어와 수식에서 추출한 정보를 기초로 구성되는 '시멘틱' 부분으로 구분될 수 있다. 여기서, '시멘틱' 부분은 새로운 수학 문제의 형태 발견에 따라 향후에 추가되거나 변경될 수 있을 것이다.The data management unit 150 stores at least one or more of the first information analyzed through the natural language processing unit 130, the second information analyzed through the modification processing unit 140, the natural language recognized through the separation unit 120, Recombine and store as recombination data. The data management unit 150 converts the recombined data into document data. Meanwhile, the data management unit 150 can define XML so that the first information, the second information, the natural language, and the formula are stored in one XML (eXtended Markup Language) tree. In the present embodiment, however, do. However, to briefly describe the XML in which the first information, the second information, the natural language, and the formula are defined, the defined XML can be roughly divided into two parts. First, the 'problem description' The second part can be divided into 'semantic' part based on information extracted from natural language and formula. Here, the 'semantic' part may be added or changed in the future depending on the discovery of new mathematical problem types.

또한, 수학 문제가 정의된 XML에 대해 설명하자면, 수학 문제는 트리 형태로 구성되고, 그에 필요한 정보가 전체 트리에서 시멘틱 부분으로 집결되는 형태의 구조를 갖도록 구성되어, 향후에 수학 문제에 대한 검색(인덱싱) 등에 이용될 수 있다. 즉, 트리 형태로 구성된 수학 문제에 의하면, 자연어와 표준화된 수식으로 표현된 수학 컨텐츠를 자연어 및 수식 구조화 장치(100)가 이해(인식)할 수 있는 형태로 변환하고, 자연어와 수식이 갖는 의미에 근거하여 시멘틱 정보를 추출하여 XML 트리 형태로 구조화할 수 있는 것이다.In addition, the mathematical problem is defined as XML. The mathematical problem is organized in a tree form, and information necessary for the mathematical problem is gathered into a semantic part in the whole tree. Indexing) or the like. That is, according to the mathematical problem configured in the tree form, the mathematical contents expressed by the natural language and the standardized mathematical expression are converted into the form that can be understood (recognized) by the natural language and mathematical formula structuring apparatus 100, The semantic information can be extracted and structured in an XML tree form.

한편, 자연어 및 수식 구조화 장치(100)는 자연어 및 수식을 구조화하기 위한 하드웨어 또는 소프트웨어 등의 컴퓨팅 자원을 저장하고, 클라이언트가 필요로 하는 컴퓨팅 자원을 클라우드 컴퓨팅(Cloud Computing)으로 해당 단말기로 제공할 수 있을 것이다. 이와 관련된 구체적인 설명은 도 6을 통해 하도록 한다.Meanwhile, the natural language and mathematical structuring apparatus 100 stores computing resources such as hardware or software for structuring natural language and mathematical expressions, and provides the computing resources required by the client to the corresponding terminals by means of cloud computing There will be. A detailed description related to this will be given in FIG.

도 2는 본 발명의 일 실시예에 따른 자연어 처리부를 개략적으로 나타낸 블럭 구성도이다.2 is a block diagram schematically showing a natural language processing unit according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 자연어 처리부(130)는 자연어 토큰화부(210), 중지 단어 필터링부(220), 중복 제거 필터링부(230) 및 동작 매칭부(240)를 포함한다. 한편, 본 발명의 일 실시예에서는 자연어 처리부(130)가 자연어 토큰화부(210), 중지 단어 필터링부(220), 중복 제거 필터링부(230) 및 동작 매칭부(240)만을 포함하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 자연어 처리부(130)에 포함되는 구성 요소에 대하여 다양하게 수정 및 변형하여 적용 가능할 것이다.The natural language processing unit 130 according to an exemplary embodiment of the present invention includes a natural language tokenizing unit 210, a stop word filtering unit 220, a deduplication filtering unit 230, and an operation matching unit 240. In an embodiment of the present invention, the natural language processing unit 130 includes only the natural language tokenizing unit 210, the stop word filtering unit 220, the deduplication filtering unit 230, and the operation matching unit 240 It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims. Various modifications and changes may be made to the components included in the natural language processing unit 130. [

자연어 토큰화부(210)는 자연어를 토큰화한 자연어 토큰을 생성한다. 자연어 토큰화부(210)는 자연어를 구성하고 있는 제 1 정보에 대해 토큰화를 수행하여 자연어 토큰을 생성한다. 여기서, 자연어 토큰이란 조합 데이터(수학 문제)에 포함된 자연어를 공백(Space)을 기준으로 분리한 결과물에 해당하는 각각의 단어를 말한다. 예를 들어서, 자연어 및 수식 구조화 장치(100)는 자연어 토큰화부(210)를 이용하여 조합 데이터에 포함된 자연어 노드를 개별적으로 입력받거나 자연어 노드 전체를 한번에 입력받을 수 있다. 여기서, 자연어는 노드 자체가 복수 개의 단어로 구성된 문장의 성질을 가지지나 완벽한 문장으로 한정되는 것은 아니다. 즉, 이런 자연어 노드를 자연어 및 수식 구조화 장치(100)가 이해할 수 있는 단위인 워드 단위로 쪼개게 되는데, 이러한 과정을 토큰화 과정이라고 한다. 한편, 자연어 노드란 조합 데이터(수학 문제)를 스키마로 구성할 때, 자연어와 수식이 순서와 상관없이 혼재된 형태를 띄게 되는데, 이때, 자연어에 해당하는 부분을 자연어 노드라 한다. 즉, 하나의 문제(즉, 스키마)에 복수 개의 자연어 부분이 포함될 수 있다. [예제 1]는 자연어 노드를 두 개를 포함하는데, 'Find the function value'와 'with'가 자연어 노드가 되는 것이다. 따라서, 시스템에 입력될 경우에는 자연어 노드를 시스템이 이해할 수 있는 단위로 쪼개는 토큰화 과정을 수행하는 것이다. 여기서, 자연어 토큰이란 조합 데이터(수학 문제)에 포함된 자연어를 공백(Space)을 기준으로 분리한 결과물에 해당하는 각각의 단어를 말한다.The natural language tokenizing unit 210 generates a natural language token by tokenizing the natural language. The natural language tokenizing unit 210 generates a natural language token by performing tokenization on the first information constituting the natural language. Here, a natural language token refers to each word corresponding to a result obtained by separating the natural language included in the combination data (mathematical problem) on the basis of a space. For example, the natural language and mathematical structuring apparatus 100 may receive the natural language nodes included in the combination data individually or all of the natural language nodes at once by using the natural language tokenizing unit 210. Here, the natural language has a nature of a sentence composed of a plurality of words, but is not limited to a complete sentence. That is, the natural language node is divided into words, which is a unit understood by the natural language and mathematical formula structuring apparatus 100, and this process is called a tokenizing process. When constructing a combination data (mathematical problem) with a schema, the natural language and the mathematical expression form a mixed form irrespective of the order. At this time, the natural language node is referred to as a natural language node. That is, a plurality of natural language parts may be included in one problem (i.e., schema). [Example 1] contains two natural language nodes, 'Find the function value' and 'with' are natural language nodes. Therefore, when inputting to a system, a tokenization process is performed to divide a natural language node into units that can be understood by the system. Here, a natural language token refers to each word corresponding to a result obtained by separating the natural language included in the combination data (mathematical problem) on the basis of a space.

중지 단어 필터링부(220)는 자연어 토큰을 근거로 중지 단어를 필터링한 중지 단어 필터링 데이터를 생성한다. 중지 단어 필터링부(220)는 자연어 토큰에서 기 설정된 중지 단어로 판별된 자연어 토큰을 선별하여 제거하는 중지 단어 필터링을 수행하여 중지 단어 필터링 데이터를 생성한다. 여기서, 중지 단어란 문장이나 수식의 분석에 있어서 필요없는 토큰에 해당하는 부분을 제거하기 위해서 미리 정의해 놓은 단어들의 집합을 의미한다. 즉, [예제 1]에서 'the'(이외에도 a나 to 등)는 시스템에서 사전(Dictionary) 형태로 미리 정의되어 있다. 여기서, 사전은 단어의 집합을 포함하는 리스트를 의미한다. 즉, 자연어 토큰을 생성한 후 자연어 처리부(130)에서는 분석에 필요없는 부분인 중지단어를 제거하는 과정을 수행하게 되는데, 중지 단어 필터링은 수학 문제가 길어질 경우(서술형 문제 등)에 분석 과정에 너무 많은 토큰이 들어가는 것을 방지해 주며, 더불어 시스템의 처리 속도를 향상 시키기 위해 동작한다. 즉, 자연어 및 수식 구조화 장치(100)는 중지 단어 필터링부(220)를 이용하여 토큰화 과정이 수행된 후 자연어를 구성하고 있는 각각의 제 1 정보가 복수 개의 토큰으로 분리되어 자연어 및 수식 구조화 장치(100)에 입력되면, 다음 단계로 중지 단어 제거 과정을 거친다. 이 과정에서는 시멘틱 의미를 추출하기 위해 필요 없는 토큰들을 제거하게 된다. 예를 들어서, '이', '저', '여기' 및 '저기' 등이 중지 단어로 설정될 수 있으나 반드시 이에 한정되는 것은 아니며, 의미상 필요 없는 토큰을 설정하는 것은 각 시스템에 따라 다르게 설정될 수 있다.The stop word filtering unit 220 generates stop word filtering data by filtering the stop word based on the natural language token. The stop word filtering unit 220 generates stop word filtering data by performing stop word filtering for selecting and removing the natural language tokens determined as predetermined stop words in the natural language token. Here, a stop word refers to a set of words that are predefined in order to remove a part corresponding to a token which is unnecessary in the analysis of a sentence or an expression. That is, in [Example 1] 'the' (other than a or to) is predefined as a dictionary in the system. Here, the dictionary means a list including a set of words. That is, after generating the natural language token, the natural language processing unit 130 performs a process of removing a stop word, which is a part unnecessary in the analysis. The stop word filtering is performed in a case where a mathematical problem is prolonged It prevents many tokens from entering and also works to improve the processing speed of the system. That is, the natural language and mathematical structuring apparatus 100 separates the first information constituting the natural language after the tokenizing process is performed using the stop word filtering unit 220 into a plurality of tokens, (100), the process of removing the stop word is performed in the next step. This eliminates unnecessary tokens to extract the semantic meaning. For example, 'i', 'low', 'here' and 'there' may be set as stop words, but not necessarily limited thereto. .

중복 제거 필터링부(230)는 중지 단어 필터링 데이터에서 중복 제거 필터링을 수행한 중복 제거 필터링 데이터를 생성한다. 중복 제거 필터링부(230)는 중지 단어 필터링 데이터에서 중복되는 데이터를 선별하여 제거하는 중복 제거 필터링을 수행하여 중복 제거 필터링 데이터를 생성한다. 즉, 자연어 및 수식 구조화 장치(100)는 중복 제거 필터링부(230)를 이용하여 중지 단어를 필터링한 후 중복을 제거하는 과정을 수행하며, 중복 제거 필터링을 통해 중복된 단어를 제거함으로써 자연어 및 수식 구조화 장치(100)의 처리 부하를 낮출 수 있다. The de-duplication filtering unit 230 generates de-duplication filtering data in which de-duplication filtering is performed on the stop word filtering data. The de-duplication filtering unit 230 generates de-duplication filtering data by performing de-duplication filtering for selectively removing duplicated data from the stop word filtering data. That is, the natural language and mathematical formula structuring apparatus 100 performs a process of filtering a stop word using a de-duplication filtering unit 230 and removing redundancy, and by removing redundant words through de-duplication filtering, The processing load of the structuring apparatus 100 can be reduced.

동작 매칭부(240)는 중복 제거 필터링 데이터에 기 정의된 의미가 부여된 동작 정보를 매칭한다. 동작 매칭부(240)는 중복 제거 필터링 데이터에서 술어에 해당하는 데이터를 기 정의된 의미가 부여된 동작 정보와 매칭 저장한다. 여기서, 동작 정보는 자연어 토큰 또는 수식 토큰을 바탕으로 추출할 수 있는 요약 정보를 의미한다. 예를 들어서, [예제 1]에서 자연어 토큰 또는 수식 토큰을 바탕으로 '풀다(Solve)'라는 동작 정보를 추출할 수 있다. 여기서, 중복 제거 필터링 데이터에서 술어에 해당하는 데이터를 동작 정보와 매칭 저장하는 이유는 조합 데이터(수학 문제)를 스키마(Schema)로 정의하는 과정에서 전체 문장이 의미하는 대표 동작에 대한 정보를 획득하여 이후에 검색 또는 문제간의 연관성(Similarity)을 분석할 때 도움이 되는 도구로 활용하기 위함이다. 자연어 및 수식 구조화 장치(100)는 동작 매칭부(240)를 이용하여 선처리 작업(Pre-Processing)을 거쳐 조합 데이터의 특성을 분석하여 기 정의된 의미가 부여된 동작을 토큰과 비교하여 매칭 저장하게 되는 것이다. 즉, 자연어 및 수식 구조화 장치(100)는 동작 매칭부(240)를 이용하여 자연어 처리부(130)에서 획득한 결과에 근거하여 조합 데이터에 포함된 수식을 '조건'이나 '정의' 등으로 묶거나, 수학 컨텐츠 자체가 의미하는 시멘틱 의미를 파악하는 데 이용할 수 있다. The operation matching unit 240 matches operation information given a predetermined meaning to the deduplication filtering data. The operation matching unit 240 stores the data corresponding to the predicate in the deduplication filtering data and stores the matching operation information with the predefined meaning. Here, the operation information means summary information that can be extracted based on a natural language token or an expression token. For example, in [Example 1], action information called 'Solve' can be extracted based on a natural language token or a mathematical token. Here, the reason why the data corresponding to the predicate in the deduplication filtering data is matched with the operation information is that in the process of defining the combination data (mathematical problem) as a schema, information on the representative operation of the whole sentence is obtained This is to be used later as a tool to assist in analyzing the similarity between searches or problems. The natural language and mathematical structuring apparatus 100 analyzes the characteristics of the combination data through a pre-processing using the operation matching unit 240, compares the predefined meaning with the token, . That is, the natural language and mathematical formula structuring apparatus 100 may classify the mathematical expressions included in the combination data into 'condition', 'definition', or the like based on the result obtained by the natural language processing unit 130 using the operation matching unit 240 , And can be used to grasp the semantic meaning of the mathematical content itself.

도 3은 본 발명의 일 실시예에 따른 수식 처리부를 개략적으로 나타낸 블럭 구성도이다.3 is a block diagram schematically showing a mathematical expression processing unit according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 수식 처리부(140)는 트리 변환부(310), 시멘틱 파서부(320) 및 수식 토큰화부(330)를 포함한다. 한편, 본 발명의 일 실시예에서는 수식 처리부(140)가 트리 변환부(310), 시멘틱 파서부(320) 및 수식 토큰화부(330)만을 포함하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 수식 처리부(140)에 포함되는 구성 요소에 대하여 다양하게 수정 및 변형하여 적용 가능할 것이다. 여기서, 시멘틱이란 해당 장치에서 특정 정보의 뜻을 이해하고 논리적 추론 가능하도록 하는 의미이다.The mathematical expression processing unit 140 according to an embodiment of the present invention includes a tree transform unit 310, a semantic parser 320, and a mathematical tokenizing unit 330. In the embodiment of the present invention, the mathematical expression processing unit 140 includes only the tree transform unit 310, the semantic parser unit 320, and the mathematical tokenizing unit 330. However, It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the essential characteristics of one embodiment of the present invention, Various modifications and variations may be applied to the elements. Here, semantics means to understand the meaning of specific information in a corresponding device and make logical reasoning possible.

자연어 및 수식 구조화 장치(100)는 정보 입력부(110)를 통해 표준화된 형식으로 작성된 개별 수식을 입력받고, 이를 수식 처리부(140)로 전달하게 된다. 즉, 수식 처리부(140)로 전달된 수식은 W3C(World Wide Web Consortium)에서 정의된 표준인 Math ML(Mathematical Markup Language)을 기준으로 XML 태그의 형태를 이룬다. 단, 수식 처리부(140)로 전달된 수식은 Math ML 인 것이 바람직하나 반드시 이에 한정되는 것은 아니다.The natural language and mathematical structuring apparatus 100 receives the individual expressions created in a standardized format through the information input unit 110 and transmits the individual expressions to the expression processing unit 140. That is, the formula transferred to the mathematical expression processing unit 140 is in the form of an XML tag based on Math ML (Mathematical Markup Language), which is a standard defined by the World Wide Web Consortium (W3C). However, the formula transferred to the mathematical expression processing unit 140 is preferably Math ML, but is not limited thereto.

트리 변환부(310)는 수식을 트리 형태로 변환한다. 트리 변환부(310)는 Math ML로 작성된 수식을 XML 트리 형태로 변환한 후 DOM 형태로 변환한다. 자연어 및 수식 구조화 장치(100)는 트리 변환부(310)를 이용하여 수식을 Math ML 형태의 XML 트리로 변환하고, 이 트리는 DOM로 변경되어 프로그램에서 접근 가능한 트리 형태로 변환된다. The tree transform unit 310 transforms the equation into a tree form. The tree transforming unit 310 transforms the mathematical expression created in the Math ML into an XML tree form, and then converts it into a DOM form. The natural language and mathematical structuring apparatus 100 converts the mathematical expression into a MathML XML tree using the tree transform unit 310, and the tree is converted into a DOM and converted into a tree form accessible by the program.

시멘틱 파서부(320)는 트리 형태로 변환된 수식에 횡단 과정을 수행한다. 시멘틱 파서부(320)는 수식을 구성하는 제 2 정보를 최하단 노드에서 점차 상위 노드로 전달되도록 하는 깊이 우선 검색 방식으로 횡단을 실행한다. 자연어 및 수식 구조화 장치(100)는 시멘틱 파서부(320)를 이용하여 수식이 가진 시멘틱 의미를 파악하기 위해서 횡단 과정을 거치는데, 시멘틱 파서부(320)는 가장 낮은 노드에서 점차 상위 노드로 정보를 전달하는 형태인 깊이 우선 검색으로 횡단을 실행한다. 이에 따라 결과적으로, 시멘틱 파서부(320)를 통해 수집된 제 2 정보는 모두 최상위 노드에 집결되고, 이러한 정보를 바탕으로 수식의 토큰을 만드는 과정을 거치게 된다. 횡단 과정과 깊이 우선 검색에 대해 구체적으로 설명하자면, 일반적으로 수식은 Math ML의 형태를 띄고 있으며, 이는 트리의 형태로 구성이 되며, 이러한, 트리를 횡단하는 과정을 횡단 과정이라 칭하며, 횡단 과정을 수행할 때, 깊이 우선 검색(Depth-First Search)을 사용한다. 이러한, 횡단 과정은 트리의 루트(Root)에서 시작하여 자식 노드까지 우선 들어간 후 모든 자식 노드의 검색이 끝나면 부모 노드로 이동하기 때문에, 자식 노드에서 가지고 있는 정보를 모두를 부모 노드로 전달한다. 시간 복잡도 측면에서 엣지(Edge)의 수만큼만 검색을 수행하면 됨으로 효율적이다.The semantic parser 320 performs the transcoding process on the expression converted into the tree form. The semantic parser 320 performs the traversal as a depth-first search method in which the second information constituting the equation is gradually transferred from the lowermost node to the upper node. The natural language and mathematical structuring apparatus 100 traverses the semantic parser 320 in order to grasp the semantic meaning of the mathematical expression using the semantic parser 320. The semantic parser 320 sequentially transmits information from the lowest node to the upper node The traversal is performed with a depth-first search, which is a form of forwarding. As a result, all the second information collected through the semantic parser 320 is gathered at the highest node, and a token of a formula is formed based on this information. To describe the transversal process and depth-first search in detail, the equation is generally in the form of a Math ML, which is organized in the form of a tree. This process of traversing the tree is called a traversal process. When doing this, use Depth-First Search. The traversal process starts from the root of the tree and goes to the child node first. Then, when all the child nodes are searched, it moves to the parent node. Therefore, all the information held by the child node is transferred to the parent node. In terms of time complexity, it is efficient to search only the number of edges.

수식 토큰화부(330)는 횡단 과정이 수행된 수식에 토큰화를 수행한 수식 토큰을 생성한다. 여기서, 수식 토큰이란 조합 데이터(수학 문제)에 포함된 수식을 파싱(Parsing)한 후 얻게 되는 개별 단위 정보를 말한다. 즉, 토큰화된 수식 토큰은 수학 언어(Mathematics Natural Language)로 이루어진 토큰을 말한다. 한편, 수식 토큰은 자연어 토큰과는 다르게 취급된다. 즉, 자연어 처리부(130)에서는 자연어 토큰을 바탕으로 동작을 매칭하는 반면, 수식 처리부(140)에서는 수식 토큰이 결과물이 되며, 향후 수식 토큰은 검색을 통해서 수학 컨텐츠를 찾는 등의 작업에 이용될 수 있다.The mathematical tokenizing unit 330 generates a mathematical token in which tokenization is performed on the mathematical expression in which the traversal process is performed. Here, the mathematical token refers to individual unit information obtained after parsing the mathematical expression included in the combination data (mathematical problem). That is, a tokenized formula token is a token made up of a Mathematics Natural Language. On the other hand, formula tokens are treated differently than natural tokens. In other words, the natural language processing unit 130 matches the action based on the natural language token, whereas the mathematical expression token becomes an output result in the mathematical expression processing unit 140, and the mathematical token can be used for an operation have.

도 4는 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 방법을 설명하기 위한 순서도이다.4 is a flowchart illustrating a method of structuring a natural language and an expression according to an embodiment of the present invention.

자연어 및 수식 구조화 장치(100)는 자연어 및 수식의 조합으로 이루어진 조합 데이터를 입력받는다(S410). 여기서, 자연어 및 수식의 조합으로 이루어진 조합 데이터는 사용자의 조작 또는 명령에 의해 직접 입력될 수 있으나 반드시 이에 한정되는 것은 아니며, 별도의 외부 서버로부터 자연어 및 수식의 조합으로 이루어진 문서 데이터를 입력받을 수도 있을 것이다. 자연어 및 수식 구조화 장치(100)는 조합 데이터에서 자연어 및 수식을 각각 분리한다(S420). 즉, 자연어 및 수식 구조화 장치(100)는 자연어 및 수식의 조합으로 이루어진 조합 데이터가 입력되면, 조합 데이터에 포함된 자연어와 수식을 각각 분리하여 인식하는 것이다.The natural language and mathematical structuring apparatus 100 receives the combination data composed of a combination of a natural language and a mathematical expression (S410). Here, the combination data composed of a combination of natural language and mathematical expression can be directly input by a user's operation or command, but not always limited thereto, and document data composed of natural language and combination of mathematical expressions may be inputted from a separate external server will be. The natural language and mathematical structuring apparatus 100 separates the natural language and the mathematical expression from the combined data, respectively (S420). That is, the natural language and mathematical structuring apparatus 100 separates and recognizes the natural language and the expression included in the combination data when the combination data composed of a combination of natural language and mathematical expression is input.

자연어 및 수식 구조화 장치(100)는 분리된 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분하는 프로세스를 처리한다(S430). 즉, 자연어 및 수식 구조화 장치(100)는 자연어를 토큰화한 자연어 토큰을 생성하고, 자연어 토큰을 근거로 중지 단어를 필터링한 단어 필터링 데이터를 생성하며, 중지 단어 필터링 데이터에서 중복 제거 필터링을 수행한 중복 제거 필터링 데이터를 생성하고, 중복 제거 필터링 데이터에 기 정의된 의미가 부여된 동작 정보를 매칭한다. 자연어 및 수식 구조화 장치(100)는 자연어를 구성하고 있는 제 1 정보에 대해 토큰화를 수행하여 자연어 토큰을 생성한다. 자연어 및 수식 구조화 장치(100)는 자연어 토큰에서 기 설정된 중지 단어로 판별된 자연어 토큰을 선별하여 제거하는 중지 단어 필터링을 수행하여 중지 단어 필터링 데이터를 생성한다. 자연어 및 수식 구조화 장치(100)는 중지 단어 필터링 데이터에서 중복되는 데이터를 선별하여 제거하는 중복 제거 필터링을 수행하여 중복 제거 필터링 데이터를 생성한다. 자연어 및 수식 구조화 장치(100)는 중복 제거 필터링 데이터에서 술어에 해당하는 데이터를 기 정의된 의미가 부여된 동작 정보와 매칭 저장한다.The natural language and mathematical formula structuring apparatus 100 analyzes the first information constituting the separated natural language and processes the information according to a specific meaning (S430). That is, the natural language and mathematical formula structuring apparatus 100 generates a natural language token obtained by tokenizing a natural language, generates word filtering data by filtering a stop word based on a natural language token, performs deduplication filtering on the stop word filtering data Generates the deduplication filtering data, and matches the predefined motion information to the deduplication filtering data. The natural language and mathematical structuring apparatus 100 generates a natural language token by performing tokenization on first information constituting a natural language. The natural language and mathematical formula structuring apparatus 100 generates stop word filtering data by performing stop word filtering for selecting and removing the natural language tokens identified as predetermined stop words in the natural language token. The natural language and mathematical structuring apparatus 100 performs deduplication filtering to selectively remove redundant data from the stop word filtering data to generate deduplication filtering data. The natural language and mathematical structuring apparatus 100 stores data corresponding to a predicate in the deduplication filtering data and stores it with operation information having a predefined meaning.

자연어 및 수식 구조화 장치(100)는 분리된 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분하는 프로세스를 처리한다(S440). 자연어 및 수식 구조화 장치(100)는 수식을 트리 형태로 변환하고, 트리 형태로 변환된 수식에 횡단 과정을 수행하고, 횡단 과정이 수행된 수식에 토큰화를 수행한다. 자연어 및 수식 구조화 장치(100)는 Math ML로 작성된 수식을 XML 트리 형태로 변환한 후 DOM 형태로 변환한다. 자연어 및 수식 구조화 장치(100)는 수식을 구성하는 제 2 정보를 최하단 노드에서 점차 상위 노드로 전달되도록 하는 깊이 우선 검색 방식으로 횡단을 실행한다.In step S440, the natural language and mathematical structuring apparatus 100 analyzes the second information constituting the separated expressions and classifies them according to a specific meaning. The natural language and mathematical structuring apparatus 100 converts a formula into a tree form, performs a transverse process on a formula transformed into a tree form, and performs tokenization on a formula in which a transit process is performed. The natural language and mathematical structuring apparatus 100 converts an expression written in Math ML into an XML tree form and then converts the expression into a DOM form. The natural language and mathematical structuring apparatus 100 executes the traversal as a depth-first search method in which the second information constituting the equation is gradually transferred from the lowermost node to the upper node.

자연어 및 수식 구조화 장치(100)는 제 1 정보, 제 2 정보, 자연어 및 수식 중 적어도 하나 이상의 정보를 재조합하여 재조합 데이터로 저장한다(S450). 자연어 및 수식 구조화 장치(100)는 재조합된 데이터를 문서 데이터로 변환한다. 즉, 단계 S410 내지 단계 S450을 수행함으로써, 자연어 및 수식 구조화 장치(100)를 통해 자연어 및 수식이 재조합된 데이터로 저장되어 관리될 수 있으며, 향후 저장된 재조합 데이터를 이용하여 수식을 검색하거나, 수식에 따른 시멘틱을 추출할 수 있을 것이다. The natural language and mathematical structuring apparatus 100 recombines at least one of the first information, the second information, the natural language, and the mathematical expression and stores it as the recombination data (S450). The natural language and mathematical structuring apparatus 100 converts the recombined data into document data. That is, by performing steps S410 to S450, natural language and mathematical expressions can be stored and managed as recombined data through the natural language and mathematical formula structuring apparatus 100, and the mathematical expression can be retrieved using the stored recombination data We can extract the semantics that follow.

도 4에서는 단계 S410 내지 단계 S450을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 4에 기재된 순서를 변경하여 실행하거나 단계 S410 내지 단계 S450 중 하나 이상의 단계를 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 4는 시계열적인 순서로 한정되는 것은 아니다.Although it is described in FIG. 4 that steps S410 to S450 are sequentially executed, it is only an exemplary description of the technical idea of an embodiment of the present invention. It is to be understood that the technical knowledge in the technical field to which an embodiment of the present invention belongs Those skilled in the art will appreciate that various modifications and adaptations may be made to those skilled in the art without departing from the essential characteristics of one embodiment of the present invention by changing the order described in FIG. 4 or by executing one or more of steps S410 through S450 in parallel And therefore, it is not limited to the time-series order in Fig.

전술한 바와 같이 도 4에 기재된 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 방법은 프로그램으로 구현되고 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 방법을 구현하기 위한 프로그램이 기록되고 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 이러한 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 또한, 본 발명의 일 실시예를 구현하기 위한 기능적인(Functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명의 일 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다.As described above, the method for structuring natural language and mathematical expressions according to an embodiment of the present invention shown in FIG. 4 can be implemented by a program and recorded on a computer readable recording medium. A program for implementing the natural language and expression structuring method according to an embodiment of the present invention is recorded, and a computer-readable recording medium includes all kinds of recording devices for storing data that can be read by a computer system. Examples of such computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, etc., and also implemented in the form of a carrier wave (e.g., transmission over the Internet) . The computer readable recording medium may also be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. In addition, functional programs, code, and code segments for implementing an embodiment of the present invention may be easily inferred by programmers skilled in the art to which an embodiment of the present invention belongs.

도 5는 본 발명의 일 실시예에 따른 수식의 트리 형태 표현을 나타낸 예시도이다.5 is an exemplary diagram illustrating a tree-like representation of a formula according to an embodiment of the present invention.

도 5에 도시된 바와 같이, 하나의 수학 컨텐츠가 가질 수 있는 구조를 표현하자면, 루트(Root) 노드에 연결된 자식 노드들은 중요 의미 중 하나인 어순 정보를 그대로 유지한 채 자연어와 수식으로 분리된 형태를 갖게 된다. 또한, 각 자연어는 문장의 연결 순서에 따라 특별한 의미를 가진다. 즉, 많은 수학 컨텐츠는 자연어를 기준으로 수식을 엮는 구조가 대부분이다. 예를 들어서, 하나의 자연어의 뒤에 따라오는 수식이 특정 조건으로서 연결되거나, 뒤에 오는 수식이 정의되었는지 등의 구조를 가질 수 있는 것이다. 각 노드의 자연어가 가지는 의미와 연결 관계 뿐만 아니라, 자연어를 통합하여 시멘틱 의미를 추출할 수 있는다. 즉, 수학 컨텐츠에서 요구하는 것이 해당 수식을 푸는 것인지, 설명하는 것인지 등의 동작을 구분하기 위해서는 전체 자연어를 한데 묶어서 의미를 파악하는 것으로 문제의 방향성을 파악하는데 이용될 수 있다.As shown in FIG. 5, when a mathematical content has a structure, child nodes connected to a root node are classified into natural language and mathematical form while maintaining word order information, . In addition, each natural language has a special meaning according to the order of connection of sentences. In other words, most of the mathematical contents have structures that weave mathematical expressions based on natural language. For example, a formula that follows a natural word can be linked as a specific condition, or the structure of the following formula can be defined. The semantic meaning can be extracted by integrating not only the meaning and connection relation of the natural language of each node but also the natural language. In other words, in order to classify the motions such as whether the mathematical contents require solving the equations, or describing them, it can be used to grasp the meaning by grouping the whole natural words together to grasp the direction of the problem.

도 6은 본 발명의 일 실시예에 따른 자연어 및 수식 구조화 장치가 클라우드 컴퓨팅으로 데이터를 제공하는 시스템에 대한 예시도이다.FIG. 6 is an exemplary diagram illustrating a system in which a natural language and expression structuring apparatus according to an embodiment of the present invention provides data to cloud computing.

본 발명의 일 실시예에 따른 자연어 및 수식 구조화 장치가 클라우드 컴퓨팅으로 데이터를 제공하기 위해서는 단말기(610), 통신망(620) 및 자연어 및 수식 구조화 장치(100)를 포함한 시스템이 필요하다.A system including a terminal 610, a communication network 620, and a natural language and expression structuring apparatus 100 is required for providing natural language and mathematical structuring apparatuses according to an embodiment of the present invention with cloud computing.

여기서, 단말기(610)는 사용자의 명령 또는 조작에 따라 통신망(620)을 경유하여 각종 데이터를 송수신할 수 있는 단말기를 말하는 것이며, 태블릿 PC(Tablet PC), 랩톱(Laptop), 개인용 컴퓨터(PC: Personal Computer), 스마트폰(Smart Phone), 개인휴대용 정보단말기(PDA: Personal Digital Assistant) 및 이동통신 단말기(Mobile Communication Terminal) 등 중 어느 하나일 수 있다. 또한, 단말기(610)는 통신망(120)을 통하여 데이터 읽고 쓰기 및 저장, 네트워크, 컨텐츠 사용 등의 서비스를 이용할 수 있는 클라우드 컴퓨팅(Cloud Computing)을 지원하는 클라우드 컴퓨팅 단말기가 될 수 있다. 즉, 단말기(610)는 통신망(620)을 경유하여 자연어 및 수식 구조화 장치(100)에 접속하기 위한 프로그램을 저장하기 위한 메모리, 프로그램을 실행하여 연산 및 제어하기 위한 마이크로프로세서 등을 구비하고 있는 장치를 의미한다. 즉, 단말기(610)는 통신망(620)에 연결되어 자연어 및 수식 구조화 장치(100)와 서버-클라이언트 통신이 가능하다면 그 어떠한 단말기도 가능하며, 노트북 컴퓨터, 이동통신 단말기, PDA 등 여하한 통신 컴퓨팅 장치를 모두 포함하는 넓은 개념이다. 한편, 단말기(610)는 터치 스크린을 구비한 형태로 제작되는 것이 바람직하나 반드시 이에 한정되는 것은 아니다. Herein, the terminal 610 is a terminal capable of transmitting and receiving various data via the communication network 620 according to a command or an operation of a user. The terminal 610 may be a tablet PC, a laptop, a personal computer (PC) A personal computer, a smart phone, a personal digital assistant (PDA), and a mobile communication terminal. In addition, the terminal 610 may be a cloud computing terminal that supports cloud computing that can use services such as data reading, writing, storing, network, and content usage through the communication network 120. That is, the terminal 610 includes a memory for storing a program for accessing the natural language and mathematical formula structuring apparatus 100 via the communication network 620, a device having a microprocessor for executing and calculating a program, . That is, the terminal 610 may be any terminal as long as it is connected to the communication network 620 and is capable of server-client communication with the natural language and mathematical formula structuring apparatus 100, and may be any communication terminal such as a notebook computer, It is a broad concept that includes all of the devices. Meanwhile, the terminal 610 is preferably formed with a touch screen, but is not limited thereto.

단말기(610)는 자연어 및 수식 구조화 장치(100)를 통해 클라우드 컴퓨팅(Cloud Computing) 방식으로 자연어 및 수식을 구조할 수 있다. 즉, 단말기(610)는 자연어 및 수식 구조화 장치(100)로부터 클라우드 컴퓨팅 방식으로 자연어 및 수식을 구조화하기 위해 자연어 및 수식 구조화 장치(100)에 저장된 저장 매체와의 입출력 인터페이스를 제공하는 별도의 입출력 인터페이스부를 포함할 수 있으며, 입출력 인터페이스부를 통해 자연어 및 수식 구조화 장치(100)에 저장된 저장 매체에 대한 데이터 읽기 및 쓰기가 수행되도록 하는 인터페이스 제어부를 포함할 수 있다. 이에 대해 좀 더 구체적으로 설명하자면, 단말기(610)는 입출력 인터페이스부를 통해 자연어 및 수식의 조합으로 이루어진 조합 데이터를 자연어 및 수식 구조화 장치(100)로 입력할 수 있고, 자연어 및 수식 구조화 장치(100)를 통해 조합 데이터에서 자연어 및 수식을 각각 분리되고, 분리된 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분하고, 분리된 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분하며, 제 1 정보, 제 2 정보, 자연어 및 수식 중 적어도 하나 이상의 정보를 재조합한 재조합 데이터를 생성/저장할 수 있도록 하므로, 실질적으로 단말기(610)에서는 어떠한 어플리케이션의 설치없이도 자연어 및 수식 구조화할 수 있는 것이다.The terminal 610 can organize natural language and mathematical expressions in a cloud computing manner through the natural language and mathematical expression structuring apparatus 100. [ That is, the terminal 610 is provided with a separate input / output interface (not shown) for providing an input / output interface with the storage medium stored in the natural language and expression structuring apparatus 100 for structuring the natural language and the expression from the natural language and mathematical formula structuring apparatus 100 to the cloud computing system. And an interface controller for reading and writing data to and from the storage medium stored in the natural language and mathematical formula structuring apparatus 100 through the input / output interface unit. More specifically, the terminal 610 can input, through the input / output interface unit, the combination data composed of a combination of natural language and mathematical expression to the natural language and mathematical expression structuring apparatus 100, And the first information constituting the separated natural language is analyzed and classified according to a specific meaning, and each second information constituting the separated expression is analyzed It is possible to generate / store the recombination data in which at least one of the first information, the second information, the natural language, and the formula is recombined, so that the terminal 610 can actually generate natural language and / It can be structured as an equation.

통신망(620)은 인터넷망, 인트라넷망, 이동통신망, 위성 통신망 등 다양한 유무선 통신 기술을 이용하여 인터넷 프로토콜로 데이터를 송수신할 수 있는 망을 말하며, 단말기(610)와 자연어 및 수식 구조화 장치(100) 간에 데이터를 중계하는 기능을 수행한다. 또한, 통신망(620)은 자연어 및 수식 구조화 장치(100)와 결합되어 하드웨어, 소프트웨어 등의 컴퓨팅 자원을 저장하고, 클라이언트가 필요로 하는 컴퓨팅 자원을 해당 단말기(610)로 제공할 수 있는 클라우드 컴퓨팅망을 포함할 수 있다.The communication network 620 is a network capable of transmitting and receiving data using an Internet protocol using various wired and wireless communication technologies such as an Internet network, an intranet network, a mobile communication network, and a satellite communication network. The network 620 includes a terminal 610, And performs the function of relaying the data between them. The communication network 620 is connected to the natural language and mathematical structuring apparatus 100 to store computing resources such as hardware and software and to provide a computing resource required by the client to the corresponding terminal 610. [ . &Lt; / RTI >

자연어 및 수식 구조화 장치(100)는 클라우드 컴퓨팅으로 단말기(610)를 통해 자연어 및 수식을 구조화할 수 있도록 하기 위해, 단말기(610)로 하여금 자연어 및 수식 구조화 장치(100)에 저장된 저장 매체에 대한 데이터의 읽기 및 쓰기가 수행되도록 하되, 자연어 및 수식의 조합으로 이루어진 조합 데이터를 입력되면, 조합 데이터에서 자연어 및 수식을 각각 분리하며, 분리된 자연어를 구성하고 있는 각각의 제 1 정보를 분석하여 특정 의미에 따라 구분하고, 분리된 수식을 구성하고 있는 각각의 제 2 정보를 분석하여 특정 의미에 따라 구분하며, 제 1 정보, 제 2 정보, 자연어 및 수식 중 적어도 하나 이상의 정보를 재조합한 재조합 데이터를 생성하는 컴퓨터로 읽을 수 있는 기록매체를 저장하며, 해당 기록매체의 일부 데이터만을 단말기(610)로 전송하여, 단말기(610)에서 어플리케이션의 설치없이 자연어 및 수식 구조화할 수 있도록 하는 클라우드 컴퓨팅을 제공할 수 있다. 즉, 자연어 및 수식 구조화 장치(100)는 클라우드 컴퓨팅 방식으로 자연어 및 수식을 구조화하기 위해 저장 매체를 저장하는 저장부와 단말기(610)로 하여금 저장 매체에 대한 데이터의 읽기 및 쓰기가 수행되도록 하는 클라우드 컴퓨팅부를 추가로 구비할 수 있다.The natural language and mathematical structuring apparatus 100 may allow the terminal 610 to provide data for the storage medium stored in the natural language and mathematical structuring apparatus 100 in order to enable natural language and mathematical expressions to be structured via the terminal 610 with cloud computing. When the combination data composed of a combination of natural language and mathematical expression is input, the natural language and the mathematical expression are separated from the combined data, and each first information constituting the separated natural language is analyzed to obtain a specific meaning , Separates the second information constituting the separated expression according to a specific meaning, generates recombination data in which at least one of the first information, the second information, the natural language, and the formula is recombined And transmits only a part of the data of the recording medium to the terminal 610, It can provide cloud computing that enables structured natural language and formulas without having to install the application from the end of 610. That is, the natural language and mathematical formula structuring apparatus 100 includes a storage unit for storing a storage medium for structuring natural language and mathematical expressions in a cloud computing manner, and a storage unit for storing the stored data in a cloud for allowing the terminal 610 to read and write data to the storage medium. A computing unit may be additionally provided.

도 7은 본 발명의 일 실시예에 따른 자연어 및 수식을 구성하고 있는 정보를 분석하여 특정 의미에 따라 구분하는 방법에 대한 예시도이다.7 is a diagram illustrating an example of a method of analyzing information constituting a natural language and a formula according to an embodiment of the present invention and classifying the information according to a specific meaning.

자연어 처리부(130) 및 수식 처리부(140)가 특정 의미를 파악하기 위해 수행하는 동작에 대해 구체적으로 설명하자면, 자연어 처리부(130) 및 수식 처리부(140)는 자연어와 수식을 구성하고 있는 각각의 구성 정보를 분석한 후 문장의 구조, 포함된 키워드 및 수식의 종류 정보 중 적어도 하나 이상의 정보를 이용하여 특정 의미를 파악할 수 있으며, 파악된 특정 의미로 구분된 시멘틱 정보를 생성할 수 있다.The natural language processing unit 130 and the mathematical expression processing unit 140 may be configured to perform operations of the natural language processing unit 130 and the mathematical expression processing unit 140 in order to grasp a specific meaning, After analyzing the information, a specific meaning can be grasped by using at least one or more of the structure of the sentence, the keyword included and the type of the expression, and the semantic information classified into the identified specific meaning can be generated.

자연어 처리부(130) 및 수식 처리부(140)는 기 설정된 룰 기반으로 동작하여 특정 의미를 파악할 수 있는 데, 이를 구체적으로 설명하자면, 도 7의 (A)에 도시된 바와 같이, 자연어 및 수식의 조합으로 이루어진 네 개의 수학 문장(P1, P2, P3, P4)이 정보 입력부(110)를 통해 입력되는 경우, 도 7의 (B)에 도시된 바와 같이, 자연어 처리부(130) 및 수식 처리부(140)에 의해 자연어를 구성하고 있는 제 1 정보와 수식을 구성하고 있는 제 2 정보를 분석한(파싱된) 결과가 생성될 수 있다. The natural language processing unit 130 and the modification processing unit 140 operate based on a predetermined rule to grasp a specific meaning. Specifically, as shown in FIG. 7A, a natural language and a combination of mathematical expressions 7B, the natural language processing unit 130 and the mathematical expression processing unit 140 are arranged in the order of the mathematical expressions P1, P2, P3, and P4, (Parsed) result of the first information constituting the natural language and the second information constituting the expression can be generated.

예를 들어서, P1의 경우, 자연어 처리부(130)에 의해 자연어를 구성하고 있는 제 1 정보를 분석한 결과 수식명(Name)이 "Find"이고 그 타입은 동사(VB)임을 나타내고, 수식 처리부(140)에 의해 수식을 구성하고 있는 제 2 정보를 분석한 결과 방정식(Equation)이 맞고(True), 다항식(Polynomial)이 맞음(True)을 나타내며 이를 도 7의 (C)에 도시된 바와 같이, 저장된 룰의 논리적 조건과 비교하면 룰 R1, R2, R3 중에서 R1과 매칭됨을 알 수 있다. 따라서 도 7의 (D)에 도시된 바와 같이, 매칭된 룰로부터 해당 논리적 조건을 만족하는 동작정보인 "Solve"를 동작정보로서 추출할 수 있다. 즉, 이러한 경우, P1이 나타내는 특정 의미를 동작 인덱스로 인식하여 구분할 수 있는 것이다.For example, in the case of P1, analyzing the first information constituting the natural language by the natural language processing unit 130 indicates that the formula name (Name) is "Find" and the type is a verb (VB) 140), the equation (Equation) is true and the polynomial is true (True). As shown in (C) of FIG. 7, Compared with the logical condition of the stored rule, it can be seen that R1 matches R1 among R1, R2 and R3. Therefore, as shown in FIG. 7 (D), "Solve" which is operation information satisfying the logical condition can be extracted as the operation information from the matched rule. That is, in this case, the specific meaning indicated by P1 can be recognized and identified as the motion index.

자연어 처리부(130) 또는 수식 처리부(140)는 기 저장된 룰의 논리적조건을 만족하는 모든 동작정보를 추출할 수 있다. 만일, 자연어 토큰과 수학식 토큰 조합이 이루는 논리적 조건이 저장된 룰의 여러 가지 논리적 조건을 만족할 수도 있으며, 이 경우는 하나의 수학 문제가 여러 개의 동작정보를 포함하고 있는 경우이며, 자연어 토큰과 수학식 토큰 조합이 어떠한 논리적 조건도 만족하지 않는 경우는 해당 복합문장은 룰 생성시 수학문장(조합 데이터)의 분석에서 누락됐거나 분석과정에 포함되지 않은 항목 또는 잘못된 수학문장인 경우로 판단할 수 있다. 또한, 자연어 처리부(130) 또는 수식 처리부(140)는 자연어 파싱의 결과 생성된 자연어토큰의 대상이 되는 수식을 수학식토큰 중에서 매칭시킬 수 있다.The natural language processing unit 130 or the mathematical expression processing unit 140 can extract all the operation information satisfying the logical condition of the stored rule. If the logical condition of the combination of the natural language token and the mathematical token may satisfy various logical conditions of the stored rule, in this case, one mathematical problem includes a plurality of pieces of operation information, If the token combination does not satisfy any logical condition, the compound sentence can be judged to be omitted in the analysis of the mathematical sentence (combination data) when the rule is generated, or an item not included in the analysis process or an incorrect mathematical sentence. In addition, the natural language processing unit 130 or the mathematical expression processing unit 140 may match mathematical expressions that are the subject of the natural language tokens generated as a result of the natural language parsing, among mathematical tokens.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas falling within the scope of the same shall be construed as falling within the scope of the present invention.

이상에서 설명한 바와 같이 본 발명은 자연어와 수식이 조합된 데이터에 대해 자연어 처리 및 수식 처리를 함께 수행한 분석 내용에 기초하여 자연어 및 수식을 재조합한 데이터로 관리할 수 있도록 하는 다양한 분야에 적용되어, 자연어 및 수식이 함께 조합된 표준 문서를 이용하여 향후 수학 컨텐츠 검색 시 키워드로 이용할 수 있는 효과를 발생하는 유용한 발명이다.As described above, the present invention is applied to various fields that enable management of natural language and mathematical expression data by recombining natural language and mathematical expression based on analysis contents in which natural language processing and mathematical expression processing are performed on combined data of natural language and mathematical expression, It is a useful invention that generates the effect that can be used as a keyword when searching for a mathematical content by using a standard document in which natural language and formula are combined together.

110: 정보 입력부 120: 분리부
130: 자연어 처리부 140: 수식 처리부
150: 데이터 관리부 210: 자연어 토큰화부
220: 중지 단어 필터링부 230: 중복 제거 필터링부
240: 동작 매칭부 310: 트리 변환부
320: 시멘틱 파서부 330: 수식 토큰화부
610: 단말기 620: 통신망110: information input unit 120:
130: natural language processing unit 140:
150: Data management unit 210: Natural language tokenizing unit
220: Stop word filtering unit 230: Duplicate elimination filtering unit
240: Operation matching unit 310: Tree conversion unit
320: Semantic parser 330: Expression tokenizer
610: Terminal 620:

Claims

An information input unit for receiving combination data composed of a combination of a natural language and a mathematical formula;
A separator for separating the natural language and the expression from the combination data;
A natural language processing unit for analyzing the natural language and generating natural language processing information classified according to a specific meaning;
A mathematical expression processing unit for analyzing the mathematical expression and generating mathematical expression processing information classified according to a specific meaning; And
A data management unit for recombining the natural language processing information and some information included in the mathematical expression processing information and storing the recombined data as recombination data;
Wherein the natural language and mathematical structuring apparatus comprises:

The method according to claim 1,
The natural language processing unit,
A natural language tokenizing unit for generating a natural language token obtained by tokenizing the natural language;
A stop word filtering unit for generating stop word filtering data in which a stop word is filtered based on the natural language token;
A de-duplication filtering unit for generating de-duplication filtering data in which de-duplication filtering is performed on the stop word filtering data; And
An operation matching unit operable to match the operation information to which the meaning defined in the pre-
Wherein the natural language and mathematical structuring apparatus comprises:

3. The method of claim 2,
The natural language tokenizing unit,
And generates the natural language token by performing tokenization on the natural language.

3. The method of claim 2,
Wherein the stop word filtering unit comprises:
Wherein the stop word filtering unit generates stop word filtering data by performing stop word filtering for selecting and removing a natural language token identified as a predetermined stop word in the natural language token.

3. The method of claim 2,
Wherein the de-
Wherein the de-duplication filtering data is generated by performing de-duplication filtering for selectively removing duplicated data from the stop word filtering data.

3. The method of claim 2,
Wherein the operation-
Wherein the data corresponding to the predicate in the de-duplication filtering data is matched with the operation information having the predefined meaning.

The method according to claim 1,
The above-
A tree transforming unit for transforming the equation into a tree form;
A semantic parser for performing a traverse process on the expression transformed into the tree form; And
A formula token generating unit for generating a mathematical token in which tokenization is performed on the mathematical expression in which the traversal process is performed,
Wherein the natural language and mathematical structuring apparatus comprises:

8. The method of claim 7,
Wherein the tree conversion unit comprises:
And converting the mathematical expression created in Math ML (Mathematical Markup Language) into an XML tree form and then converting the mathematical form into a DOM (Document Object Model).

8. The method of claim 7,
Wherein the semantic parser comprises:
Wherein the traversal is performed by a depth-first search method in which the expression is gradually transferred from an uppermost node to an uppermost node.

The method according to claim 1,
The data management unit,
And converting the recombined data into document data.

An information input unit for inputting combination data composed of a combination of a natural language and a formula;
A separator for separating the natural language and the expression from the combination data;
Generating stop word filtering data in which a stop word is filtered based on the natural language token, generating duplicate removal filtering data in which duplicate removal filtering is performed on the stop word filtering data, A natural language processing unit for generating natural language processing information by matching the operation information given with the predefined meaning to the duplicate removal filtering data;
A mathematical expression processing unit for analyzing the mathematical expression and generating mathematical expression processing information classified according to a specific meaning; And
A data management unit for recombining the natural language processing information and some information included in the mathematical expression processing information and storing the recombined data as recombination data;
Wherein the natural language and mathematical structuring apparatus comprises:

Separating the natural language and the mathematical expression from the combined data when the combined data composed of a combination of natural language and mathematical expression is input, generating natural language processing information classified according to a specific meaning by analyzing the natural language, A storage unit for storing a program for generating mathematical expression processing information classified according to a specific meaning and generating recombination data in which the natural language processing information and some information included in the mathematical expression processing information are recombined; And
And a terminal for allowing the terminal to read and write data on the storage medium,
Wherein the natural language and mathematical structuring apparatus comprises:

An information input unit for inputting combination data composed of a combination of a natural language and a formula;
A separator for separating the natural language and the expression from the combination data;
A natural language processing unit for analyzing the natural language and generating natural language processing information classified according to a specific meaning;
A mathematical expression processing unit for converting the mathematical expression into a tree form, performing a traverse process on the mathematical expression converted into the tree form, and generating a mathematical token in which tokenization is performed on the mathematical expression in which the traversal process is performed; And
A data management unit for recombining the natural language processing information and some information included in the mathematical expression token and storing the recombined data as recombination data;
Wherein the natural language and mathematical structuring apparatus comprises:

An information input step of inputting combination data composed of a combination of a natural language and a formula;
A separating step of separating the natural language and the expression from the combination data, respectively;
A natural language processing step of analyzing the natural language and generating natural language processing information classified according to a specific meaning;
A mathematical expression processing step of analyzing the mathematical expression and generating mathematical expression processing information classified according to a specific meaning; And
A data management step of recombining the natural language processing information and some information included in the mathematical expression processing information and storing the recombined data as recombination data
Wherein the natural language and the mathematical expression are structured as follows.

15. The method of claim 14,
In the natural language processing step,
A natural language tokenization step of generating a natural language token by tokenizing the natural language;
A stop word filtering step of generating stop word filtering data in which a stop word is filtered based on the natural language token;
A de-duplication filtering step of generating de-duplication filtering data in which de-duplication filtering is performed on the stop word filtering data; And
An operation matching step of matching the operation information imparted with the predefined meaning to the de-duplication filtering data
Wherein the natural language and the mathematical expression are structured as follows.

15. The method of claim 14,
The above-
A tree transforming step of transforming the equation into a tree form;
A semantic parser step of performing a transitive process on the expression transformed into the tree form; And
A formula tokenization step of performing tokenization on the formula in which the transit process is performed
Wherein the natural language and the mathematical expression are structured as follows.

A computer-readable recording medium having recorded thereon a program for executing each step of a natural language and a formula structuring method according to any one of claims 14 to 16.