KR20160131730A

KR20160131730A - System, Apparatus and Method For Processing Natural Language, and Computer Readable Recording Medium

Info

Publication number: KR20160131730A
Application number: KR1020150064724A
Authority: KR
Inventors: 박은상; 김경덕; 류성한; 이근배
Original assignee: 삼성전자주식회사
Priority date: 2015-05-08
Filing date: 2015-05-08
Publication date: 2016-11-16

Abstract

The present invention relates to a system, an apparatus, and a method for processing a natural language and a computer readable recording medium, capable of dividing and creating short sentences based on the boundaries between sentences. According to an embodiment of the present invention, the natural language processing apparatus includes a storage unit for storing corpus information including an entity name category to which an entity name word in a sentence belongs and an identifier for distinguishing a boundary between sentences, a communication interface unit for receiving a compound sentence or a plurality of short sentences input into a user device, and a natural language processing unit for creating a plurality of short sentences corresponding to the compound sentence or a complex sentence using the identifier of the corpus information.

Description

TECHNICAL FIELD [0001] The present invention relates to a natural language processing system, a natural language processing apparatus, a natural language processing method, and a computer readable recording medium,

본 발명은 자연어 처리 시스템, 자연어 처리 장치, 자연어 처리 방법 및 컴퓨터 판독가능 기록매체에 관한 것으로서, 더 상세하게는 가령 음성 대화 시스템, 질의 응답 시스템 및 잡담 시스템과 같은 자연어 처리 시스템에서, 중문을 단문으로 변경처리할 때, 입력된 중문에서 개체명 단어의 범주를 가령 통계적 번역 기법에 적용해 판단된 문장 경계를 근거로 단문을 분할 생성하려는 자연어 처리 시스템, 자연어 처리 장치, 자연어 처리 방법 및 컴퓨터 판독가능 기록매체에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a natural language processing system, a natural language processing apparatus, a natural language processing method and a computer readable recording medium, and more particularly to a natural language processing system such as a voice conversation system, A natural language processing apparatus, a natural language processing method, and a computer readable recording method for dividing a short sentence based on a sentence boundary determined by applying a category of an object name word to a statistical translation technique, Media.

통상적으로 기계 번역은 컴퓨터 시스템이 자동으로 입력 언어의 자연어(natural language) 문장 F를 목표한 언어의 자연어 문장 E로 변환하는 것을 의미하다. 기계 번역 기법 중 통계적 기계 번역은 훈련 데이터를 바탕으로 기계 번역 모델을 학습하고 학습된 모델을 바탕으로 기계 번역을 수행한다. 보다 구체적으로는 F가 주어졌을 때 E의 확률 Pr(E｜F)를 최대로 하는 E를 찾는 과정이다. 즉 E는 F에 대한 최선의 번역 결과이다. 이는 <수학식 1>과 같이 나타낼 수 있다.Typically, machine translation means that the computer system automatically translates the natural language sentence F of the input language into the natural language sentence E of the target language. Among the machine translation techniques, statistical machine translation learns machine translation model based on training data and performs machine translation based on learned model. More specifically, when F is given, it is a process to find E that maximizes the probability of E (E | F). That is, E is the best translation result for F. This can be expressed as Equation (1).

위의 <수학식 1>에 베이즈 법칙(Bayes' rule)을 적용해 Pr(E｜F)를 분해(decompose)함으로써 <수학식 2>를 얻을 수 있다.Equation (2) can be obtained by decomposing Pr (E | F) by applying Bayes' rule to Equation (1) above.

이때, Pr(F｜E)는 번역 모델(translation model)로 E가 주어졌을 때 F로 번역될 확률을 말하며, E를 F로 번역하는 것이 얼마나 적절한지를 나타낸다. 번역 모델은 두 개 언어(bilingual)에 대한 훈련 데이터를 바탕으로 학습된다.Pr (F | E) is the translation model's probability of being translated into F when E is given, and indicates how appropriate it is to translate E into F. Translation models are learned based on training data for bilinguals.

이때 Pr(E)는 언어 모델로 E가 해당 언어에서 나타날 확률을 말하며, E가 얼마나 자연스러운지를 나타낸다. 언어 모델은 한 개 언어(monolingual)에 대한 훈련 데이터를 바탕으로 학습된다.Here, Pr (E) is the language model and represents the probability that E will appear in the language, and indicates how natural E is. Language models are learned based on training data for a single language (monolingual).

종래의 자연어 처리 시스템은 입력 문장에서 형태소 정보, 구문 구조, 의미 등을 분석한다. 이때 하나의 입력 문장은 최소 크기의 기본 문장이거나 복수의 기본 문장으로 구성된 문장, 즉 복합 문장이다.The conventional natural language processing system analyzes the morpheme information, the syntax structure, and the meaning in the input sentence. At this time, one input sentence is a basic sentence of minimum size or a sentence composed of plural basic sentences, that is, a compound sentence.

복합 문장을 구성하는 기본 문장들은 서로 다양한 형태로 연결된다. Basic sentences composing a compound sentence are connected to each other in various forms.

예를 들어, TV 프로그램 관련 음성 명령을 인식하고 수행하는 자연어 처리 시스템을 가정해 보자. For example, consider a natural language processing system that recognizes and executes voice commands related to TV programs.

TV 사용자는 자연어 처리 시스템으로 "Record OCN news and show me Family Guy"라는 복합 문장을 발화할 수 있는데, 이 복합 문장은 "Record OCN news"와 "Show me Family Guy"라는 기본 문장이 "and"라는 접속사에 의해 연결된다.TV users can speak the compound sentence "Record OCN news" and "Show me Family Guy" as a natural language processing system. The sentence "Record OCN news" and "Show me Family Guy" It is connected by a connective.

또한 한국어와 같은 몇몇 언어에서는 접속사에 의해 문장이 연결될 때 문장이 변형되는 경우가 있다. TV 사용자는 자연어 처리 시스템에 "무한도전 녹화하고 1박2일 틀어줘"라는 복합 문장을 발화할 수 있는데, 이 복합 문장은 "무한도전 녹화해줘"와 "1박2일 틀어줘"라는 기본 문장이 "고"라는 접속사에 의해 연결된다.Also, in some languages such as Korean, sentences may be transformed when a sentence is connected by a conjunction. TV users can spell out the compound sentence "Record Infinite Challenge and play 1 night and 2 days" in the natural language processing system. This compound sentence is composed of basic sentence "Let me record infinite challenge" and " Is connected by a conjunction called "high ".

한편, TV 사용자는 자연어 처리 시스템으로 "Record OCN news show me Family Guy"라는 복합 문장을 발화할 수 있는데, 이 복합 문장은 TV 사용자가 접속사 없이 두 문장을 연속으로 발화하는 경우에 발생한다.On the other hand, a TV user can utter a compound sentence called "Record OCN news show me Family Guy" as a natural language processing system. This compound sentence occurs when a TV user fires two sentences without a connection.

그런데, 이러한 복합 문장은 종래의 자연어 처리 시스템이 처리하기 어려우므로, 자연어 처리 시스템의 성능이 저하되는 문제가 있다.However, since such a complex sentence is difficult to be processed by a conventional natural language processing system, the performance of the natural language processing system is deteriorated.

본 발명의 실시예는 가령 음성 대화 시스템, 질의 응답 시스템 및 잡담 시스템과 같은 자연어 처리 시스템에서, 중문을 단문으로 변경처리할 때, 입력된 중문에서 개체명 단어의 범주를 가령 통계적 번역 기법에 적용해 판단된 문장 경계를 근거로 단문을 분할 생성하려는 자연어 처리 시스템, 자연어 처리 장치, 자연어 처리 방법 및 컴퓨터 판독가능 기록매체를 제공함에 그 목적이 있다.The embodiment of the present invention applies, for example, the category of the object name word to the statistical translation technique in the input Chinese sentence when the Chinese language is changed into the short sentence in a natural language processing system such as a voice conversation system, a query response system and a chat system A natural language processing apparatus, a natural language processing method, and a computer readable recording medium in which a short sentence is divided and generated based on a determined sentence boundary.

본 발명의 실시예에 따른 자연어 처리 시스템은, 중문 또는 복문의 문장을 입력받는 사용자 장치, 및 상기 중문 또는 복문이 입력되면, 기저장된 말뭉치 정보의 식별자를 이용하여 상기 중문 또는 복문에 대응되는 복수의 단문을 생성하고, 상기 생성된 복수의 단문을 상기 사용자 장치에 제공하는 자연어 처리 장치를 포함하고, 상기 말뭉치 정보는, 문장 내의 개체명 단어가 속하는 개체명 범주 및 문장 경계를 구분짓는 식별자를 포함한다.A natural language processing system according to an embodiment of the present invention is a natural language processing system for inputting a sentence of a Chinese sentence or a complex sentence, and, when the Chinese sentence or the sentence is inputted, And a natural language processing device for generating a short sentence and providing the generated plurality of short sentences to the user device, wherein the corpus information includes an identifier for distinguishing an object name category and a sentence boundary to which the object name word belongs in the sentence .

또한 본 발명의 실시예에 따른 자연어 처리 장치는, 문장 내의 개체명 단어가 속하는 개체명 범주 및 문장 간 경계를 구분짓는 식별자가 포함된 말뭉치 정보를 저장하는 저장부, 사용자 장치에 입력된 중문 또는 복문의 문장을 수신하는 통신 인터페이스부, 및 상기 중문 또는 복문이 입력되면, 상기 저장한 말뭉치 정보의 식별자를 이용하여 상기 중문 또는 복문에 대응되는 복수의 단문을 생성하고, 상기 생성된 복수의 단문을 상기 사용자 장치에 제공하는 자연어 처리부를 포함한다.The natural language processing apparatus according to an embodiment of the present invention may further include a storage unit for storing corpus information including an object name category to which the object name word in the sentence belongs and an identifier for distinguishing the boundary between the sentences, And generating a plurality of short sentences corresponding to the Chinese or Japanese sentences using the stored identifiers of the corpus information when the Chinese or Chinese sentences are input, And a natural language processing unit for providing the natural language processing unit to the user apparatus.

상기 자연어 처리부는, 상기 수신한 중문 또는 복문에서 개체명 단어의 범주를 판단하고, 상기 판단한 개체명 범주가 포함되는 상기 저장부의 말뭉치 정보를 추출하며, 상기 추출한 말뭉치 정보를 근거로 상기 복수의 단문을 생성할 수 있다.Wherein the natural language processing unit judges the category of the entity name word in the received Chinese sentence or the complex sentence, extracts corpus information of the storage unit including the determined entity name category, and extracts the plurality of short phrases based on the extracted corpus information Can be generated.

상기 자연어 처리부는, 상기 중문 또는 복문에서 개체명 단어의 범주를 판단하는 개체명 인식 실행부, 및 상기 중문의 개체명 단어를 상기 판단한 개체명 범주로 변경하고, 상기 변경한 개체명 범주에 관련된 상기 말뭉치 정보를 취득하며, 상기 취득한 말뭉치 정보를 근거로 상기 복수의 단문을 생성하는 통계적 번역 실행부를 포함할 수 있다.Wherein the natural language processing unit comprises an entity name recognition execution unit for determining a category of the entity name word in the middle or mixed sentence and a control unit for changing the entity name word of the Chinese character to the determined entity name category, And a statistical translation executing unit for obtaining the corpus information and generating the plurality of short phrases based on the acquired corpus information.

상기 통계적 번역 실행부는, 상기 식별자를 근거로 상기 중문 또는 복문을 분할하고, 상기 분할한 중문 또는 복문을 상기 복수의 단문으로 복원하여 상기 복수의 단문을 생성할 수 있다.The statistical translation executing unit may generate the plurality of short sentences by dividing the Chinese sentence or the plural sentences based on the identifier and restoring the divided Chinese sentences or the plural sentences into the plurality of short sentences.

상기 자연어 처리부는, 상기 중문 또는 복문과 동일 언어로 상기 복수의 단문을 생성해 상기 사용자 장치로 제공할 수 있다.The natural language processing unit may generate the plurality of short sentences in the same language as the middle or complex sentences and provide the same to the user apparatus.

상기 자연어 처리부는, 상기 중문 또는 복문과 다른 언어로 상기 복수의 단문을 번역하고, 상기 접속사가 포함된 번역 문장을 생성해 상기 사용자 장치로 제공할 수 있다.The natural language processing unit may translate the plurality of short sentences in a language other than the Chinese or Chinese sentences, and may generate a translated sentence including the conjunction and provide the translated sentence to the user device.

상기 자연어 처리부는, 설정된 시간 간격을 두고 상기 접속사 없이 연결된 복수의 단문이 연속으로 제공될 때, 상기 중문으로 판단할 수 있다.The natural language processing unit may determine the middle language when a plurality of short sentences connected without the conjunction at a predetermined time interval are continuously provided.

상기 자연어 처리부는, 상기 사용자 장치에서 단문 또는 유사 단문이 입력되면, 상기 입력된 단문 또는 유사 단문 내의 개체명 단어가 속하는 개체명 범주와 관련되는 말뭉치 정보를 근거로 단문을 생성해 상기 사용자 장치로 제공할 수 있다.Wherein the natural language processing unit generates a short sentence based on the corpus information related to the object name category to which the object name word in the inputted short or similar short sentence belongs, can do.

상기 자연어 처리부는, 서로 다른 개체명 단어를 포함하는 제1 중문 및 제2 중문의 개체명 범주가 서로 일치하면, 동일 말뭉치 정보를 취득할 수 있다.The natural language processing unit can acquire the same corpora information if the object name categories of the first and second Chinese manuscripts including different object name words coincide with each other.

상기 식별자는 기호 또는 비트 정보를 포함할 수 있다.The identifier may include symbol or bit information.

나아가, 본 발명의 실시예에 따른 자연어 처리 방법은, 문장 내의 개체명 단어가 속하는 개체명 범주 및 문장 간 경계를 구분짓는 식별자가 포함된 말뭉치 정보를 저장하는 단계, 사용자 장치에 입력된 중문 또는 복문의 문장을 수신하는 단계, 및 상기 중문 또는 복문이 입력되면, 상기 저장한 말뭉치 정보의 식별자를 이용하여 상기 중문 또는 복문에 대응되는 복수의 단문을 생성하고, 상기 생성된 복수의 단문을 상기 사용자 장치에 제공하는 단계를 포함한다.Further, a natural language processing method according to an embodiment of the present invention includes storing corpus information including an entity name category to which an object name word in a sentence belongs and an identifier for distinguishing a boundary between sentences, a Chinese word or a sentence And generating a plurality of short sentences corresponding to the middle or mixed sentences by using the stored identifiers of the corpus or corpus information when the middle or mixed sentences are input, .

상기 사용자 장치로 제공하는 단계는, 상기 수신한 중문 또는 복문에서 개체명 단어의 범주를 판단하는 단계, 상기 판단한 개체명 범주가 포함되는 상기 저장부의 말뭉치 정보를 추출하는 단계, 및 상기 추출한 말뭉치 정보를 근거로 상기 복수의 단문을 생성하는 단계를 포함할 수 있다.Wherein the step of providing the corpus information to the user device comprises the steps of: determining a category of the entity name word in the received Chinese or sentence; extracting corpus information of the storage unit including the determined entity name category; And generating the plurality of short sentences based on the plurality of short sentences.

상기 사용자 장치로 제공하는 단계는, 상기 중문 또는 복문에서 개체명 단어의 범주를 판단하는 단계, 상기 중문 또는 복문의 개체명 단어를 상기 판단한 개체명 범주로 변경하는 단계, 상기 변경한 개체명 범주에 관련된 상기 말뭉치 정보를 취득하는 단계, 및 상기 취득한 말뭉치 정보를 근거로 상기 복수의 단문을 생성하는 단계를 포함할 수 있다.Wherein the step of providing to the user device comprises the steps of: determining the category of the entity name word in the middle or complex sentence; changing the entity name word of the middle or complex sentence to the determined entity name category; Acquiring the related corpus information, and generating the plurality of short phrases based on the acquired corpus information.

상기 복수의 단문을 생성하는 단계는, 상기 식별자를 근거로 상기 중문 또는 복문을 분할하는 단계, 및 상기 분할한 문장을 상기 복수의 단문으로 복원하는 단계를 포함할 수 있다.The step of generating the plurality of short sentences may include dividing the middle or complex sentences based on the identifiers, and restoring the divided sentences into the plurality of short sentences.

상기 사용자 장치로 제공하는 단계는, 상기 중문 또는 복문과 동일 언어로 상기 복수의 단문을 생성해 상기 사용자 장치로 제공할 수 있다.The step of providing to the user device may generate the plurality of short sentences in the same language as the Chinese or Chinese sentences and provide the same to the user device.

상기 사용자 장치로 제공하는 단계는, 설정된 시간 간격을 두고 상기 접속사 없이 연결된 복수의 단문이 연속으로 제공될 때, 상기 중문으로 판단하는 단계를 포함할 수 있다.The step of providing to the user device may include the step of determining, when the plurality of short sentences connected without the conjunction are continuously provided at the set time intervals, as the middle sentence.

상기 사용자 장치로 제공하는 단계는, 서로 다른 개체명 단어를 포함하는 제1 중문 및 제2 중문의 개체명 범주가 서로 일치하면, 동일 말뭉치 정보를 취득하는 단계를 포함할 수 있다.The step of providing to the user device may include acquiring the same corpus information if the categories of entity names of the first and second Chinese characters including the different entity name words coincide with each other.

한편, 본 발명의 실시예에 따른 컴퓨터 판독가능 기록매체는, 자연어 처리 방법을 실행하기 위한 프로그램을 포함하는 컴퓨터 판독가능 기록매체에 있어서, 문장 내의 개체명 단어가 속하는 개체명 범주 및 문장 간 경계를 구분짓는 식별자가 포함된 말뭉치 정보를 저장하는 단계, 사용자 장치에 입력된 중문 또는 복문의 문장을 수신하는 단계, 및 상기 중문 또는 복문이 입력되면, 상기 저장한 말뭉치 정보의 식별자를 이용하여 상기 중문 또는 복문에 대응되는 복수의 단문을 생성하고, 상기 생성된 복수의 단문을 상기 사용자 장치에 제공하는 단계를 실행한다.Meanwhile, a computer-readable recording medium according to an embodiment of the present invention is a computer-readable recording medium including a program for executing a natural language processing method, the computer program product comprising: Storing a corpus information including a discriminating identifier, receiving a sentence of a Chinese or a Chinese sentence input to the user apparatus, and, when the Chinese sentence or the sentence is inputted, using the identifier of the stored corpus information, Generating a plurality of short sentences corresponding to the complex sentences, and providing the generated plurality of short sentences to the user apparatus.

도 1은 본 발명의 실시예에 따른 자연어 처리 시스템을 나타내는 도면,
도 2는 도 1의 자연어 처리 장치의 세부 구조를 나타내는 도면,
도 3은 도 1의 자연어 처리 장치의 다른 세부 구조를 나타내는 도면,
도 4는 도 3의 기본 및 복합 문장 병렬 말뭉치 정보의 예를 나타내는 도면,
도 5는 도 2의 자연어 처리부 또는 도 3의 자연어 처리 모듈에서 처리되는 입력 문장의 변형 예를 나타내는 도면,
도 6은 본 발명의 실시예에 따른 자연어 처리 과정을 나타내는 도면,
도 7은 본 발명의 제1 실시예에 따른 자연어 처리 방법을 나타내는 흐름도, 그리고
도 8은 본 발명의 제2 실시예에 따른 자연어 처리 방법을 나타내는 흐름도이다.1 is a diagram showing a natural language processing system according to an embodiment of the present invention;
Fig. 2 is a view showing the detailed structure of the natural language processing apparatus of Fig. 1,
Fig. 3 is a view showing another detailed structure of the natural language processing apparatus of Fig. 1,
FIG. 4 is a diagram showing an example of basic and compound sentence parallel corpus information in FIG. 3;
FIG. 5 is a diagram showing a modified example of an input sentence processed by the natural language processing unit of FIG. 2 or the natural language processing module of FIG. 3;
6 is a diagram illustrating a natural language process according to an embodiment of the present invention;
7 is a flowchart showing a natural language processing method according to the first embodiment of the present invention, and
8 is a flowchart showing a natural language processing method according to the second embodiment of the present invention.

이하, 도면을 참조하여 본 발명의 실시예에 대하여 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 실시예에 따른 자연어 처리 시스템을 나타내는 도면이다.1 is a diagram showing a natural language processing system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 자연어 처리 시스템(90)은 사용자 장치(100), 통신망(110) 및 자연어 처리 장치(120)의 일부 또는 전부를 포함할 수 있다.1, the natural language processing system 90 according to the embodiment of the present invention may include some or all of the user apparatus 100, the communication network 110, and the natural language processing apparatus 120. [

여기서, 일부 또는 전부를 포함한다는 것은 통신망(110) 및 자연어 처리 장치(120) 중 적어도 하나의 구성요소가 생략되어 사용자 장치(100)가 독립적으로(stand-alone) 자연어 처리 동작을 수행하거나, 통신망(110) 내의 네트워크 장치와 연동하여 자연어 처리 동작을 수행하는 것을 의미하며, 나아가 통신망(110)과 같은 구성요소가 생략되어 사용자 장치(100)와 자연어 처리 장치(120)가 다이렉트(ex. P2P) 통신을 수행할 수 있는 것 등을 의미하는 것으로서, 발명의 충분한 이해를 돕기 위하여 전부 포함하는 것으로 설명한다.Including some or all of them here means that at least one component of the network 110 and the natural language processing device 120 is omitted so that the user device 100 performs the natural language processing operation stand-alone, P2P means that the user device 100 and the natural language processing device 120 perform a natural language processing operation in cooperation with a network device in the network 110. In addition, The present invention is not limited to the above-described embodiments.

사용자 장치(100)는 예를 들어 검색, 음성 대화, 질의 응답 및 잡담 기능이 가능한 DTV, 스마트폰, 데스크탑 컴퓨터, 랩탑 컴퓨터, 태블릿 PC 및 웨어러블 장치 등의 디스플레이 장치를 포함할 수 있다. 물론 디스플레이 장치가 아니라 하더라도 위의 기능이 가능하다면 어떠한 장치이어도 무관하다. 가령, 질의 응답을 가정해 보면, 이러한 사용자 장치(100)는 답변을 요청하는 사용자로부터 검색창이나 마이크로폰을 통해 텍스트나 음성 질의를 수신하며, 수신한 질의가 통신망(110)을 경유해 자연어 처리 장치(120)에 제공되도록 한다. 이때 사용자 장치(100)는 텍스트 기반의 인식 결과를 자연어 처리 장치(120)로 제공할 수 있다. 예를 들어, 질의로서 음성을 수신하는 경우, 사용자 장치(100)는 가령 마이크로폰과 같은 음성 수신부를 통해 음성 질의를 수신하고, *-Voice와 같은 발화 엔진 즉 프로그램을 이용해 수신한 음성 질의를 인식하여 인식 결과를 텍스트 기반으로 출력할 수 있다.User device 100 may include display devices such as DTVs, smart phones, desktop computers, laptop computers, tablet PCs and wearable devices capable of searching, voice chatting, querying and chatting capabilities. Of course, any device may be used as long as the above functions are possible even if it is not a display device. For example, assuming a query response, the user device 100 receives a text or voice query from a user requesting an answer through a search window or a microphone, and the received query is transmitted to the natural language processing device (Not shown). At this time, the user device 100 may provide the text-based recognition result to the natural language processing device 120. [ For example, when receiving a voice as a query, the user device 100 receives a voice query through a voice receiver such as a microphone, recognizes a voice query received using a speech engine such as * -Voice The recognition result can be output in a text-based manner.

다만, 사용자 장치(100)에 비해 자연어 처리 장치(120)가 좀더 월등한 성능의 엔진 즉 프로그램을 가질 수 있기 때문에 텍스트 기반의 인식 결과는 자연어 처리 장치(120)에서 생성하는 것이 더욱 바람직하다. 다시 말해, 사용자 장치(100)는 마이크로폰을 통해 수신한 음성 신호만 전달하고, 자연어 처리 장치(120)는 수신한 음성 신호를 기반으로 음성 인식 및 텍스트 기반의 인식 결과를 생성하는 것이다. 따라서, 본 발명의 실시예에서는 인식 결과가 어떻게 처리되는지에 대하여 특별히 한정하지는 않을 것이다.However, since the natural language processing apparatus 120 can have an engine or program having superior performance as compared with the user apparatus 100, it is more preferable that the natural language processing apparatus 120 generates the text-based recognition result. In other words, the user device 100 transmits only voice signals received through the microphone, and the natural language processing device 120 generates voice recognition and text-based recognition results based on the received voice signals. Therefore, in the embodiment of the present invention, how the recognition result is processed is not particularly limited.

본 발명의 실시예에 따라 사용자 장치(100)는 사용자로부터 다양한 형태의 질의를 수신할 수 있다. 여기서, 다양한 형태의 질의를 수신한다는 것은 간략하게는 단어와 문장을 의미하지만, 더 정확하게는 단어라 하더라도 하나의 단어를 수신하는 경우와 복수의 단어를 수신하는 경우, 또 문장의 형태로 수신하는 경우를 의미할 수 있다. 이때, 본 발명의 실시예에서는 문장의 경우가 더 바람직하다. 여기서, 문장은 단문, 중문 및 복문 형태를 포함하고, 이들의 조합에 의한 형태를 더 포함할 수 있으며, 형태란 가령 중문뿐 아니라, 중문과 유사한 다른 경우를 더 포함할 수 있다. 또한 복문은 관계대명사 등에 의해 2개의 문장이 의존적인 관계에 있는 것이라면, 본 발명의 실시예에 따른 복합 문장은 중문 또는 중문 형태의 문장을 의미하다. 예를 들어, 'OCN 뉴스 녹화해줘'와 'Family Guy를 보여줘'는 각각 단문에 해당된다. 이러한 각각의 단문이 등위 접속사 등의 접속사로 연결되면 중문을 형성한다. 다시 말해, 'OCN 뉴스를 녹화하고, Family Guy를 보여줘'는 중문에 해당된다.In accordance with an embodiment of the present invention, the user device 100 may receive various types of queries from a user. Here, receiving various types of queries means words and sentences in brief. However, more precisely, even when a word is received, when a plurality of words are received, or when a word is received in the form of a sentence . &Lt; / RTI > At this time, in the embodiment of the present invention, the case of the sentence is more preferable. Here, the sentence includes a short sentence, a middle sentence and a complicated form, and may further include a form by a combination of these. The form may include not only a Chinese sentence, but also other cases similar to Chinese sentences. In addition, if the two sentences are related to each other by the relative pronoun, the compound sentence according to the embodiment of the present invention means a sentence in the form of Chinese or Chinese. For example, 'Record OCN News' and 'Show Family Guy' are short paragraphs, respectively. When each of these short sentences is connected to a conjunction such as a consonant conjunction, it forms a Chinese sentence. In other words, "Record OCN News, Show Family Guy" is a Chinese sentence.

그런데, 이러한 접속사는 다양한 형태로 변경될 수 있다. 가령 'OCN 뉴스를 녹화한 다음, Family Guy를 보여줘', 'OCN 뉴스를 녹화하고 난 후에, Family Guy를 보여줘'와 같은 경우가 그 좋은 예이다. 이와 같이 접속사는 다양한 형태로 변이될 수 있다. 나아가, 접속사 없이 일정 시간 간격을 두고 연속으로 2개의 단문이 제공될 수 있다. 예를 들어, 'OCN 뉴스 녹화해 Family Guy를 보여줘'라는 2개의 단문이 연이어 제공되는 것이다. 본 발명의 실시예에 따라 사용자 장치(100)는 위에서와 같은 다양한 형태의 중문을 텍스트나 음성의 형태로 수신할 수 있을 것이다.However, such conjunctions can be varied in various forms. A good example of this is when you have recorded OCN news, then show Family Guy, and after you have recorded OCN news, show Family Guy. Thus, the conjunction can be varied in various forms. Furthermore, two short sentences can be provided in succession at regular time intervals without conjunction. For example, there are two short sentences: 'Record OCN News and Show Family Guy'. According to an embodiment of the present invention, the user device 100 may receive various forms of Chinese text as described above in the form of text or voice.

이와 같이 사용자가 제공한 문장 즉 단문이든 중문이든 사용자 장치(100)는 가령 자연어 처리 장치(120)에서 제공하는 답변을 원문 형태를 갖는 복수의 문장 가령 단문으로 제공받아 사용자 명령을 수신할 수 있다. 예를 들어, 위의 'OCN 뉴스를 녹화하고, Family Guy를 보여줘'를 가정하면, 'OCN 뉴스를 녹화해줘'와 'Family Guy를 보여줘'의 2개의 단문 형태로 수신하게 된다. 이때, 각 단문은 동일 언어인 것이 바람직하지만, 서로 다른 언어의 형태로 제공될 수도 있다. 이는 언어가 다른 사용자들간 채팅의 상황에 유용할 수 있을 것이다. 이와 같이 사용자 장치(100)는 사용자가 제공한 중문 형태의 문장을 자연어 처리 장치(120)를 통해 사용자 장치(100)가 인식 가능한 단문 형태로 수신함으로써 사용자가 발화한 질의에 대한 동작을 용이하게 수행할 수 있을 것이다.In this way, whether the sentence, that is, a short sentence or a middle sentence, provided by the user, the user apparatus 100 can receive a user command by receiving a plurality of sentences having a textual form, for example, a short sentence, For example, suppose you record the OCN News and Show Family Guys above. You will receive two short sentences: 'Record OCN News' and 'Show Family Guy'. At this time, although each short sentence is preferably the same language, it may be provided in a different language form. This could be useful for chat situations where different languages are used. In this way, the user device 100 can easily perform an operation on a query uttered by the user by receiving the Chinese text sentence provided by the user in a short form recognizable by the user device 100 through the natural language processing device 120 You can do it.

통신망(110)은 유무선 통신망을 모두 포함한다. 여기서 유선망은 케이블망이나 공중 전화망(PSTN)과 같은 인터넷망을 포함하는 것이고, 무선 통신망은 CDMA, WCDMA, GSM, EPC(Evolved Packet Core), LTE(Long Term Evolution), 와이브로 망 등을 포함하는 의미이다. 물론 본 발명의 실시예에 따른 통신망(110)은 이에 한정되는 것이 아니며, 향후 구현될 차세대 이동통신 시스템의 접속망으로서 가령 클라우드 컴퓨팅 환경하의 클라우드 컴퓨팅망 등에 사용될 수 있다. 가령, 통신망(110)이 유선 통신망인 경우 통신망(110) 내의 액세스포인트는 전화국의 교환국 등에 접속할 수 있지만, 무선 통신망인 경우에는 통신사에서 운용하는 SGSN 또는 GGSN(Gateway GPRS Support Node)에 접속하여 데이터를 처리하거나, BTS(Base Station Transmission), NodeB, e-NodeB 등의 다양한 중계기에 접속하여 데이터를 처리할 수 있다. The communication network 110 includes both wired and wireless communication networks. Here, the wired network includes an Internet network such as a cable network or a public switched telephone network (PSTN), and the wireless communication network includes means such as CDMA, WCDMA, GSM, Evolved Packet Core (EPC), Long Term Evolution (LTE) to be. Of course, the communication network 110 according to the embodiment of the present invention is not limited to this, and it may be used as an access network of a next generation mobile communication system to be implemented in future, for example, in a cloud computing network under a cloud computing environment. For example, when the communication network 110 is a wired communication network, the access point in the communication network 110 can access the exchange of a telephone office, and in the case of a wireless communication network, the access point can access the SGSN or GGSN (Gateway GPRS Support Node) Or may be connected to various repeaters such as Base Station Transmission (BTS), NodeB, and e-NodeB to process data.

통신망(110)은 액세스포인트를 포함할 수 있다. 액세스포인트는 건물 내에 많이 설치되는 펨토(femto) 또는 피코(pico) 기지국과 같은 소형 기지국을 포함한다. 여기서, 펨토 또는 피코 기지국은 소형 기지국의 분류상 사용자 장치(100)를 최대 몇 대까지 접속할 수 있느냐에 따라 구분된다. 물론 액세스포인트는 사용자 장치(100)와 지그비 및 와이파이(Wi-Fi) 등의 근거리 통신을 수행하기 위한 근거리 통신 모듈을 포함한다. 액세스포인트는 무선통신을 위하여 TCP/IP 혹은 RTSP(Real-Time Streaming Protocol)를 이용할 수 있다. 여기서, 근거리 통신은 와이파이 이외에 블루투스, 지그비, 적외선(IrDA), UHF(Ultra High Frequency) 및 VHF(Very High Frequency)와 같은 RF(Radio Frequency) 및 초광대역 통신(UWB) 등의 다양한 규격으로 수행될 수 있다. 이에 따라 액세스포인트는 데이터 패킷의 위치를 추출하고, 추출된 위치에 대한 최상의 통신 경로를 지정하며, 지정된 통신 경로를 따라 데이터 패킷을 다음 장치, 예컨대 사용자 장치(100)로 전달할 수 있다. 액세스포인트는 일반적인 네트워크 환경에서 여러 회선을 공유할 수 있으며, 예컨대 라우터(router), 리피터(repeater) 및 중계기 등이 포함될 수 있다.The communication network 110 may include an access point. The access point includes a small base station such as a femto or pico base station, which is installed in a large number of buildings. Here, the femto or pico base station is classified according to the maximum number of user devices 100 that can be connected on the classification of the small base stations. Of course, the access point includes a user equipment 100 and a short-range communication module for performing short-range communication such as ZigBee and Wi-Fi. The access point may use TCP / IP or RTSP (Real-Time Streaming Protocol) for wireless communication. In this case, the short-range communication is performed by various standards such as RF (Radio Frequency) and UWB (Ultra Wide Band) communication such as Bluetooth, Zigbee, IrDA, UHF and VHF . Accordingly, the access point can extract the location of the data packet, specify the best communication path to the extracted location, and forward the data packet to the next device, e.g., the user device 100, along the designated communication path. The access point may share a plurality of lines in a general network environment, and may include, for example, a router, a repeater, and a repeater.

자연어 처리 장치(120)는 단문이나 중문 형태로 사용자 장치(100)로부터 수신한 입력 문장, 더 정확하게는 중문 형태의 문장에 대하여 2개의 문장 사이의 경계를 신속하게 예측하며, 이를 위해 단순화된 기계어를 사용한다. 여기서, 단순화된 기계어란 입력된 중문을 빠르게 분할된 단문의 형태로 변경하기 위하여 내부적으로 사용되는 언어라 볼 수 있다. 이후에 다시 설명하겠지만, 예를 들어 자연어 처리 장치(120)는 사용자가 "Harry Potter 주연이 누군지 알려주고, KBS 틀어줘"라는 중문과 "'바람과 함께 사라지다' 주연이 누군지 알려주고, KBS 틀어줘"라는 문장을 수신한 경우, 'Harry Potter'와 '바람과 함께 사라지다'가 학습에 의해 영화라는 개체명 범주(또는 개체명 속성, 개체명 유형)에 속하는 것으로 판단되었다면, 동일하게 '알려줘 @movie 주연,'의 형태로 변경(혹은 치환, 번역)하여 이를 근거로 통계적 번역 기법을 적용해 복수의 단문을 신속하게 분할 생성할 수 있다. 여기서, 개체명 범주란 문장 내의 개체명 단어가 속하는 범주를 의미한다. 예를 들어, 개체명 즉 단어의 명칭은 영화 이름으로서 'Harry Potter', '바람과 함께 사라지다', 채널 명칭으로서 'OCN', 인물로서 '*바마' 등이 될 수 있고, 이러한 개체명이 속하는 범주는 영화, 채널명, 인물 등으로 구분될 수 있다. 그러나, 다른 문장에서 개체명 'Harry Potter'의 범주는 인물이 될 수도 있다.The natural language processing device 120 quickly predicts the boundary between two sentences for an input sentence received from the user device 100 in the form of a short or middle sentence, or more precisely a sentence in the form of a Chinese sentence, and a simplified machine language use. Here, the simplified machine language can be regarded as a language used internally to change the input Chinese sentence into the form of a short divided sentence. For example, the natural language processing device 120 may be provided with a sentence such as "Letting KBS know who the main character of Harry Potter is, KBS play" and "Letting KBS go with the wind" If you have learned that 'Harry Potter' and 'Destroy with the Wind' are learned by learning to belong to the object name category (or object name attribute, object type type) (Or substitution, translation) in the form of a sentence, and apply the statistical translation technique on the basis thereof to quickly generate a plurality of short sentences. Here, the object name category means the category to which the object name word belongs in the sentence. For example, the object name or word name may be 'Harry Potter', 'disappear with the wind', 'OCN' as a channel name, '* Bama' as a character name, Can be classified into a movie, a channel name, a character, and the like. However, in other sentences the category of the object name 'Harry Potter' could be a person.

여기서, 통계적 (기계) 번역 기법은 다양한 형태의 중문, 더 정확하게는 개체명 범주가 태깅된 중문을 A라 할 때, A에 훈련(또는 학습)된 중문 B를 출력한다. 다시 말해, A 문장의 처리 결과로서, 여러 단문이 특정 식별자로 연결된 형태의 문장 B가 생성된다. 예를 들어, 중문 B는 2개의 단문에 대하여 식별자로서 가령 문장 구분 기호가 삽입된 '@ movie 녹화해 # @ movie 보여줘'와 같은 형태가 될 수 있다. 이와 같은 문장 구분 기호를 근거로 자연어 처리 장치(120)는 2개의 단문에 대한 경계를 예측하고 이를 근거로 'Harry Potter를 녹화해'와 '바람과 함께 사라지다를 보여줘'라는 2개의 단문 B를 생성해 사용자 장치(100)에 제공하거나, 이를 다른 언어로 변경하여 사용자 장치(100)로 제공할 수 있다. 또는 2개의 단문을 다른 언어로 번역한 후 접속사로 다시 연결하여 사용자 장치(100)로 제공할 수도 있을 것이다. 여기서, 식별자는 기호로서 설명하였지만, 비트(bit) 정보의 형태가 될 수도 있으므로, 특별히 한정하지는 않을 것이다. 또한, 번역 결과물이 비문법적 혹은 어순 변화가 발생할 수도 있음을 고려해 볼 때, 위의 '@ movie 녹화해 # @ movie 보여줘'는 '녹화해 @ movie # 보여줘 @ movie'의 형태가 될 수도 있다. 따라서 본 발명의 실시예에서는 위의 결과물의 출력 형태에 특별히 한정하지는 않을 것이다.Here, the statistical (mechanical) translation technique outputs a Chinese sentence B that has been trained (or learned) in A, when various types of Chinese sentences, or more precisely, a Chinese sentence tagged with an object name category are A, In other words, as a result of the processing of the A sentence, a sentence B in which a plurality of short sentences are connected to a specific identifier is generated. For example, the Chinese character B can be an identifier for two short sentences, such as '@ movie record # @ movie show' inserted with a sentence separator. Based on the sentence separator, the natural language processing device 120 predicts the boundaries of the two short sentences and generates two short sentences B such as "Record Harry Potter" and " To the user device 100, or to change it to a different language and provide it to the user device 100. Or may translate the two short sentences into another language and then reconnect them to the user and provide them to the user device 100. Here, although the identifier is described as a symbol, it may be in the form of bit information, and thus is not particularly limited. Also, considering that the translation result may be changed in an ungrammatical or word order, the above 'Record @ movie' and 'Show @ movie' may be in the form of 'Record @ Movie # @ Movie'. Therefore, in the embodiment of the present invention, the output form of the above result is not particularly limited.

가령 사용자가 사용자 장치(100)의 검색창을 통해 텍스트 형태로 문장을 제공한 경우에 자연어 처리 장치(120)는 해당 텍스트 문장을 그대로 이용할 수 있다. 다만, 음성으로 발화하여 음성 신호가 제공된 경우에는 가령 내부의 자유 발화 엔진을 이용하여 해당 음성 문장을 텍스트 기반의 인식 결과를 얻어 사용할 수 있을 것이다. 물론 사용자 장치(100)에서 이러한 음성 문장을 인식하여 텍스트 기반의 결과를 제공하는 경우에는 이의 과정이 생략될 수 있다. 따라서, 본 발명의 실시예에서는 자연어 처리 장치(120)가 어떠한 형태로 사용자가 제공한 문장을 입력받고, 또 분할된 2개의 단문을 어떠한 형태로 사용자 장치(100)에 제공하는지에 대하여는 특별히 한정하지는 않을 것이다.For example, when a user provides a text in the form of a text through a search window of the user device 100, the natural language processing device 120 can use the text as it is. However, in the case where a speech signal is provided by speech, a text-based recognition result may be obtained by using the speech engine of the present invention. Of course, if the user device 100 recognizes such a voice sentence and provides a text based result, the process may be omitted. Therefore, in the embodiment of the present invention, the manner in which the natural language processing apparatus 120 receives the sentences provided by the user and provides the divided two short sentences to the user apparatus 100 is not particularly limited I will not.

좀더 구체적으로, 본 발명의 실시예에 따른 자연어 처리 장치(120)는 입력된 문장에 대한 개체명의 범주 즉 속성을 판단하기 위한 개체명 말뭉치 정보를 기저장할 수 있다. 예를 들어, 입력 문장에서 'Harry Potter'는 영화일 수 있지만, 사람을 지칭할 수도 있을 것이다. 따라서, 자연어 처리 장치(120)는 저장된 많은 개체명 말뭉치 정보를 근거로 훈련 및 학습을 수행함으로써 해당 입력 문장에서 개체명의 범주를 신속히 구분하게 된다. 다시 말해, 개체명 말뭉치 정보를 근거로 개체명 인식 모델을 학습하고, 학습된 개체명 인식 모델을 근거로 하여 입력된 다양한 단문이나 중문에 대하여 신속하게 자동으로 개체명 범주를 찾을 수 있다. 이러한 과정은 개체명 인식 과정이 된다. More specifically, the natural language processing apparatus 120 according to an embodiment of the present invention may store the entity name corpus information for determining the category of the entity name, that is, the attribute of the inputted sentence. For example, "Harry Potter" in the input sentence could be a movie, but it could also refer to a person. Accordingly, the natural language processing apparatus 120 performs training and learning based on a lot of entity name corpus information stored therein, thereby quickly distinguishing categories of entity names in the corresponding input sentences. In other words, the object name recognition model can be learned based on the entity name corpus information, and the object name category can be quickly and automatically found for various short and middle sentences inputted based on the learned object name recognition model. This process becomes the process of recognizing the object name.

또한 개체명, 더 정확하게는 개체명 범주가 태깅된 중문을 통계적 번역 기법에 적용하기 위하여, 자연어 처리 장치(120)는 단문과 중문이 혼합된 다양한 병렬 말뭉치 정보를 기저장할 수 있다. 이의 병렬 말뭉치 정보의 경우에도 입력된 문장, 다시 말해 본 발명의 실시예에 따라 내부 기계어로 치환된 형태의 문장에 대하여 빠르게 분할된 단문 변환이 이루어질 수 있도록, 해당 병렬 말뭉치 정보들에 대하여도 훈련할 수 있다. 예를 들어, 병렬 말뭉치 정보를 바탕으로 통계적 번역 모델을 훈련하고, 훈련된 통계적 번역 모델을 바탕으로 개체명 범주가 태깅된 입력 문장을 복수의 기본 문장으로 분할 및 복원한다.Also, in order to apply the Chinese name tagged with the object name, or more precisely, the object name category to the statistical translation technique, the natural language processing device 120 may store various parallel corpus corpus information mixed with short and middle Chinese words. Even in the case of the parallel corpus information, it is also possible to train the parallel corpus information so that a fast segmented short phrase conversion can be performed on the input sentence, that is, the sentence having the form replaced with the internal machine language according to the embodiment of the present invention . For example, a statistical translation model is trained on the basis of parallel corpus information, and an input sentence tagged with an object name category is divided and restored into a plurality of basic sentences based on a trained statistical translation model.

여기서, 복원이란 앞서 언급한 다양한 형태로 변이된 입력 문장에 대하여 원본 문장을 찾는 과정이다. 예를 들어, 입력된 말뭉치 정보에 의해 훈련된 통계적 번역 모델은 복수의 기본 문장을 수학적 알고리즘 또는 프로그램의 실행을 통해 바로 출력해 줄 수도 있을 것이다. 이러한 복원 또한 다양한 동작이 가능할 수 있다. 예를 들어, 입력된 중문을 개체명 범주로 치환할 때, 해당 개체명을 별도의 저장을 통해 알고 있기 때문에 분할된 문장에서 치환된 개체명 범주로 변경 전 개체명으로 바꾸고, 한국어의 경우 종결 어미를 수학적 알고리즘 또는 DB에 저장된 정보를 근거로 적절히 변경해 줄 수 있을 것이다.가령, DB를 검색한 결과, '한 후에', '하고'는 '해줘'로 변경하도록 정보가 매칭되어 있다면 해당 정보를 변경하면 되는 것이다. 다만, 본 발명의 실시예에서는 이러한 동작이 번역 모델에 의해 처리되는 것이 바람직하다. 이와 같이 복원에 있어서도 다양한 방식이 가능할 수 있으므로 위의 내용에 특별히 한정하지는 않을 것이다. Here, restoration is a process of finding an original sentence for an input sentence shifted to the above-mentioned various forms. For example, a statistical translation model trained by input corpus information may output a plurality of basic sentences directly through the execution of a mathematical algorithm or program. Such restoration may also allow for various operations. For example, when replacing the input Chinese name with the object name category, the object name is changed to the object name before the change in the replaced object name category in the divided sentence because the object name is known through a separate storage. In the case of Korean, For example, if the information is matched to be changed to "after" or "after" as a result of searching the DB, the information may be changed That is what you should do. However, in the embodiment of the present invention, such an operation is preferably processed by a translation model. As described above, various restoration methods may be possible, so that the above description is not particularly limited.

요약하면, 본 발명의 실시예는 시스템 훈련 단계에서는 기본 문장 즉 단문으로부터 자동 또는 수동으로 복수의 기본 문장으로 구성된 문장 즉 복합 문장 또는 중문을 구성(혹은 사전에 구축)한 후 기본 문장의 목록과 복합 문장이 쌍으로 구성된 훈련 데이터를 바탕으로 통계적 번역 모델을 학습한다. 시스템 실행 단계에서는 입력 문장에 훈련된 통계적 번역 모델을 적용해 복합 문장으로부터 복수의 기본 문장들을 분할 및 복원해 낼 수 있도록 하는 것이다.In summary, in the system training step, a basic sentence, that is, a sentence composed of a plurality of basic sentences automatically or manually from a short sentence, that is, a composite sentence or a middle sentence, is constructed The statistical translation model is learned based on training data consisting of pairs of sentences. In the system implementation phase, a trained statistical translation model is applied to the input sentence so that multiple basic sentences can be divided and restored from the compound sentence.

한편, 지금까지는 자연어 처리 장치(120)가 통계적 번역 모델을 적용하여 입력 문장에 훈련된 말뭉치 정보를 출력하고, 이를 근거로 입력 문장을 분할 및 복원하는 것을 설명하였다. 이를 위하여, 자연어 처리 장치(120)는 성능을 개선하기 위해 개체명 범주를 판단하고, 이를 근거로 통계적 번역 모델을 실행시켰다. 그러나, 본 발명의 실시예에서는 나아가 통계적 번역 모델이 아니라 하더라도, 이에 훈련된 별도의 말뭉치 정보를 DB에 기저장하여 이용할 수 있을 것이다. 다시 말해, 입력 문장 즉 단문 또는 중문에서 개체명 범주를 판단하고, 판단한 개체명 범주를 근거로 DB를 검색하여 매칭되는 말뭉치 정보를 추출한다. 그리고 추출한 말뭉치 정보의 식별자를 근거로 복수의 단문으로 분할 및 복원할 수 있다. 이에 본 발명의 실시예에서는 위의 번역 모델에 특별히 한정하지는 않을 것이다.Meanwhile, up to now, the natural language processing apparatus 120 has applied the statistical translation model to output the trained corpus information in the input sentence, and divides and restores the input sentence based thereon. For this purpose, the natural language processing device 120 determines the category of the object name in order to improve the performance, and executes a statistical translation model based on the category. However, in the embodiment of the present invention, even if it is not a statistical translation model, separate corpus information trained thereby may be stored in the DB for use. In other words, the object name category is determined in the input sentence, that is, the short sentence or the middle sentence, and the DB is searched based on the determined object name category to extract the matching corpus information. Then, it can be divided and restored into a plurality of short sentences based on the extracted identifier of the corpus information. Therefore, in the embodiment of the present invention, the above translation model is not particularly limited.

또한 자연어 처리 장치(120)는 입력된 문장이 단문인지 중문인지 판단한 후에, 서로 다른 방식으로 입력 문장을 처리할 수도 있을 것이다. 예를 들어, 단문인 경우, 종래의 자연어 처리 방식을 이용하고, 중문의 경우에만 본 발명의 실시예에 따른 기법을 적용할 수 있다. 이로 인해 시스템 구축에 따르는 비용을 절약할 수 있을 것이다. 다만, 효율성을 위하여 본 발명의 실시예는 단문 또는 중문이 입력 문장으로 제공될 때, 모두 처리 가능하도록 시스템이 설계되는 것이 바람직하다.Also, the natural language processing unit 120 may process the input sentences in different ways after determining whether the inputted sentences are short sentences or middle sentences. For example, in the case of a short text, a conventional natural language processing method can be used, and a technique according to an embodiment of the present invention can be applied only to a Chinese text. This will save the cost of building the system. However, for the sake of efficiency, it is preferable that the embodiment of the present invention is designed such that when both short and middle sentences are provided as input sentences, the system is designed to be able to process all of them.

상기의 구성 결과, 본 발명의 실시예는 길고 복잡한 입력 문장에 대하여 복수의 단문으로 분할 및 복원이 가능하고, 이를 통해 시스템의 성능을 높일 수 있을 것이다. 다시 말해, 기존의 단문 중심의 처리만 가능했던 시스템이 단문 및 중문 형태 등의 처리가 모두 가능하므로 시스템의 성능이 증대될 수 있을 것이다.As a result of the above configuration, the embodiment of the present invention can divide and restore a long and complicated input sentence into a plurality of short sentences, thereby improving the performance of the system. In other words, it is possible to improve the system performance because the system which can only process the existing short text can process both short and long text forms.

도 2는 도 1의 자연어 처리 장치의 세부 구조를 나타내는 도면으로, 하드웨어적으로 구분되어 구성되는 것을 예시하였다.FIG. 2 is a view showing the detailed structure of the natural language processing apparatus of FIG. 1, and it is exemplified that the natural language processing apparatus is constituted by hardware.

설명의 편의상 도 2를 도 1과 함께 참조하면, 본 발명의 실시예에 따른 자연어 처리 장치(120)는 통신 인터페이스부(200), 자연어 처리부(210) 및 저장부(220)의 일부 또는 전부를 포함한다.2, a natural language processing apparatus 120 according to an embodiment of the present invention includes a communication interface unit 200, a natural language processing unit 210, and a storage unit 220, .

여기서, 일부 또는 전부를 포함한다는 것은 통신 인터페이스부(200)와 같은 일부 구성요소가 생략되거나, 저장부(220)와 같은 일부 구성 요소가 자연어 처리부(210)와 같은 다른 구성요소에 통합되어 구성될 수 있는 것 등을 의미하는 것으로서, 발명의 충분한 이해를 돕기 위하여 전부 포함하는 것으로 설명한다.Here, including some or all of them may mean that some components such as the communication interface unit 200 are omitted, or that some components such as the storage unit 220 are integrated with other components such as the natural language processing unit 210 And the like, and the description will be made in order to facilitate a sufficient understanding of the invention.

통신 인터페이스부(200)는 사용자 장치(100)가 제공한 단문 또는 중문 형태의 자연어를 수신한다. 이때, 해당 자연어는 사용자 장치(100)의 검색, 질의 응답 또는 채팅과 같은 동작에 의해 제공되는 문장일 수 있으며, 검색이나 질의 응답의 경우 음성 인식을 통해 이루어질 수 있다. 이의 경우, 바람직하게는 텍스트 기반의 인식 결과가 입력 문장으로 제공되는 것이 바람직하지만, 단순히 음성 신호가 입력되는 경우에는 자연어 처리부(210)에 제공되어, 텍스트 기반의 인식 결과가 생성될 수 있을 것이다.The communication interface unit 200 receives the natural language of the short or middle Chinese characters provided by the user device 100. [ At this time, the natural language may be a sentence provided by an operation such as a search, a query response, or a chat of the user device 100, or may be performed through speech recognition in the case of a search or a query response. In this case, it is preferable that a text-based recognition result is provided as an input sentence, but when a voice signal is simply input, the natural language processing unit 210 may be provided with a text-based recognition result.

또한 통신 인터페이스부(200)는 입력된 자연어를 자연어 처리부(210)에 제공하고, 자연어 처리부(210)에 의해 처리된 결과를 자연어 처리부(210)의 제어 하에 사용자 장치(100)로 전송할 수 있다. 여기서, 처리된 결과는 입력된 복합 문장 즉 중문에 대하여 분할된 복수의 단문 형태로서 동일 언어로 제공될 수 있다. 또는 다른 언어로 변경된 복수의 단문 형태가 될 수도 있다. 나아가, 다른 언어로 변경된 복수의 단문이 접속사 즉 초기 입력된 복합 문장의 접속사로 연결되어 제공될 수 있다. 물론 접속사도 해당 번역 언어와 동일할 것이다. 예를 들어, 한국어의 '~고' 또는 '그리고,'는 영어의 ',' 또는 'and'로 변경되어 제공될 수 있다. 이를 기반으로서, 사용자 장치(100)는 사용자의 음성 발화에 따른 동작을 수행하거나, 질의 응답이 이루어질 수 있으며, 채팅 동작이 가능할 수 있다. The communication interface unit 200 may provide the input natural language to the natural language processing unit 210 and may transmit the result processed by the natural language processing unit 210 to the user device 100 under the control of the natural language processing unit 210. Here, the processed result can be provided in the same language as a plurality of short sentences divided into input compound sentences, i.e., Chinese sentences. Or a plurality of short sentences changed to another language. Furthermore, a plurality of short sentences changed to another language may be provided as a conjunction, that is, a conjunction of an initially input compound sentence. Of course, the conjunction will also be the same as the translation language. For example, the Korean word '~' or 'and' may be changed to ',' or 'and' in English. Based on this, the user device 100 may perform an operation according to the voice utterance of the user, a query response may be performed, and a chat operation may be possible.

자연어 처리부(210)는 입력된 다양한 형태의 중문에 대하여 개체명 인식 동작을 수행한다. 예를 들어, 앞서 충분히 설명한 바와 같이 저장부(220)에 저장된 개체명 말뭉치 정보와 이에 훈련된 인식 모델을 근거로 입력된 문장의 개체명에 대한 범주를 판단한다. 이때, 본 발명의 실시예에서는 입력된 문장에서, 가령 'Harry Potter' 또는 '바람과 함께 사라지다'는 영화의 범주에 속한다는 것을 알게 된다.The natural language processing unit 210 performs the entity name recognition operation on the inputted various types of Chinese characters. For example, as described above, the categories of the object names of the inputted sentences are determined based on the entity name corpus information stored in the storage unit 220 and the recognition model. At this time, in the embodiment of the present invention, it is found that in the inputted sentence, for example, "Harry Potter" or "disappear with the wind" belongs to the category of the movie.

또한 자연어 처리부(210)는 입력된 중문에서, 각 개체명의 범주 정보를 이용하여 입력된 중문을 통계적 번역 모델에 적용하기 위한 기계어로 치환할 수 있다. 다시 말해, 'Harry Potter를 녹화하고, 바람과 함께 사라지다를 보여줘'라는 중문이 입력되면, 'Harry Potter'와 '바람과 함께 사라지다'는 영화의 범주에 속한다는 것을 알았기 때문에, '@ movie 녹화해, @ movie 보여줘'의 형태로 치환 즉 변경한다. 이러한 데이터는 실질적으로 '0'과 '1'을 이용한 비트 정보의 형태로 처리될 수도 있을 것이다. 이러한 사항은 시스템 설계자 등에 의해 얼마든지 변경될 수 있는 사항이므로 이에 특별히 한정하지는 않을 것이다. 다시 말해, 비트 정보로 처리하는 경우에는 수학적 번역 모델이 아닌, DB를 이용하는 경우에도 더욱 유용할 수 있을 것이다.In addition, the natural language processing unit 210 can substitute the input Chinese characters using the category information of each entity name with a machine language for applying the statistical translation model to the input Chinese character. In other words, if you enter the Chinese phrase "Record Harry Potter and show up with the wind", you know that it belongs to the category of "Harry Potter" and "Gone with the Wind" Change it in the form of 'show movie'. Such data may be processed substantially in the form of bit information using '0' and '1'. These matters are not particularly limited because they can be changed by the system designer as much as possible. In other words, in the case of processing with bit information, it may be more useful when using a DB instead of a mathematical translation model.

물론 자연어 처리부(210)는 예를 들어, 'Harry Potter를 녹화하고, 사랑과 영혼을 보여줘'라는 중문이 입력된 경우에도 위의 경우에서와 동일한 데이터 정보로 치환될 수 있다. 이는 다시 말해, 본 발명의 실시예는 개체명 인식 과정을 통해 개체명을 정확히 인식하려는 것이 아니라, 개체명의 일반화된 범주 정보와 통계 번역 기법을 통해 입력된 중문을 신속하게 복수의 단문으로 분할하기 위한 것이다.Of course, the natural language processing unit 210 may be replaced with the same data information as in the above case, for example, even if a Chinese word " Record Harry Potter and show love and soul " is input. In other words, the embodiment of the present invention is not intended to accurately recognize the entity name through the entity name recognition process, but rather to divide the Chinese character input through the generalized category information of the entity name and the statistical translation technique into a plurality of short sentences will be.

이와 같이 자연어 처리부(210)는 기본 및 복합 문장 병렬 말뭉치 정보에 훈련된 통계적 번역 모델을 적용해 범주 정보가 태깅된 입력 문장에 대한 복합 문장 말뭉치 정보를 추출하고, 이를 근거로 복수의 단문을 생성할 수 있다. 예를 들어, 자연어 처리부(210)는 번역 모델을 적용해 '@ movie 녹화해', '@ movie 보여줘'에 대한 '@ movie 녹화해 # @ movie 보여줘'를 출력한다. 그러면, 자연어 처리부(210)는 이를 근거로 문장의 경계를 판단하고, 그 판단 결과에 따라 2개의 독립된 단문을 생성하게 된다. 예를 들어, 'Harry Potter를 녹화해'와 '바람과 함께 사라지다를 보여줘'라는 2개의 단문을 생성하게 되는 것이다.In this way, the natural language processing unit 210 extracts the complex sentence corpus information about the input sentence tagged with the category information by applying the trained statistical translation model to the basic and compound sentence parallel corpus corpus information, and generates a plurality of short sentences . For example, the natural language processing unit 210 applies a translation model and displays '@ movie record' and 'show @ movie' for 'record @ movie' and 'show @ movie'. Then, the natural language processing unit 210 determines the boundary of the sentence based on this, and generates two independent short sentences according to the determination result. For example, you would create two shorts: "Record Harry Potter" and "Show me how to disappear with the wind."

이와 같이 자연어 처리부(210)는 입력 문장의 개체명에 대한 범주 정보를 알고 이를 단순한 기계어로 치환한 후, 치환된 정보 즉 기계어를 통계적 번역 모델에 적용함으로써 입력 문장의 처리를 신속하게 수행할 수 있게 된다.In this way, the natural language processing unit 210 can recognize the category information of the object name of the input sentence, replace it with a simple machine language, apply the substituted information, that is, the machine language to the statistical translation model, do.

지금까지는 자연어 처리부(210)가 개체명 인식 모델이나 통계적 번역 모델과 같은 수학적 알고리즘을 통해 결과를 출력하는 살펴보았다. 즉 이러한 모델들은 다양한 정보에 훈련된 모델이므로 다양한 형태의 입력 문장에 대하여 빠르게 결과를 출력할 수 있다. 그러나 앞서 언급한 대로, 본 발명의 실시예는 개체명을 DB에서 검색하여 개체명 범주에 대한 정보를 추출하고, 이때 추출한 복수의 개체명 범주에 대한 정보를 다시 발뭉치 정보가 저장된 DB를 검색하여 관련된 말뭉치 정보를 추출해 낼 수 있다. 그리고 추출된 말뭉치 정보의 식별자 정보를 근거로 문장을 분할 및 복원하는 것도 얼마든지 가능하므로 본 발명의 실시예에서는 위의 내용에 특별히 한정하지는 않을 것이다.Until now, the natural language processing unit 210 has output a result through a mathematical algorithm such as an entity recognition model or a statistical translation model. That is, since these models are trained models for various information, it is possible to output the results quickly for various types of input sentences. However, as described above, according to the embodiment of the present invention, the object name is searched in the DB to extract information on the object name category, and information on the extracted plurality of object name categories is retrieved from the DB storing the foot ball information The related corpus information can be extracted. It is also possible to divide and restore the sentence based on the identifier information of the extracted corpus information, so that the embodiment of the present invention is not limited to the above contents.

한편, 자연어 처리부(210)는 별도의 도면으로 나타내지는 않았지만, 하드웨어적으로 구분된 제어부 및 자연어 처리 실행부를 포함하며, 제어부는 다시 물리적으로 구분된 CPU와 메모리를 포함할 수 있다. 자연어 처리 실행부는 본 발명의 실시예에 따른 자연어 처리 실행을 위한 프로그램을 포함할 수 있다. 이에 따라, CPU는 시스템의 초기 동작시, 자연어 처리 실행부의 프로그램을 메모리로 가져와 자연어 처리 실행 동작을 수행할 수 있다. 물론 이러한 동작이 아니라 해도, CPU가 자연어 처리 실행부를 실행시킨 후 처리 결과만을 수신하는 것도 얼마든지 가능하므로 본 발명의 실시예에서는 위의 내용에 특별히 한정하지는 않을 것이다.Meanwhile, the natural language processing unit 210 includes a control unit and a natural language processing execution unit, which are not shown in a separate drawing, and the control unit may include a physically separated CPU and memory. The natural language processing execution unit may include a program for executing natural language processing according to an embodiment of the present invention. Accordingly, at the time of initial operation of the system, the CPU can bring the program of the natural language processing execution unit into the memory and perform the natural language processing execution operation. Of course, even if this is not the case, it is possible to receive the processing result only after the CPU executes the natural language processing execution unit, so that the embodiment of the present invention is not limited to the above contents.

저장부(220)는 메모리나 DB와 같은 하드웨어 및 레지스트리(registry)와 같은 소프트웨어 저장부를 포함한다. 저장부(220)는 앞서 언급한 개체명 말뭉치 정보, 기본 및 복합 문장 병렬 말뭉치 정보를 저장할 수 있다. 실제로 이러한 정보는 자연어 처리부(210)의 제어 하에 출력되지만, 저장부(220)는 자연어 처리부(210)의 요청에 따라 시스템 초기 동작시 해당 정보들을 모두 제공하여 자연어 처리부(210) 내의 저장소에 저장되도록 할 수 있으므로 본 발명의 실시예에서 어떠한 방식으로 정보가 처리되는지에 특별히 한정하지는 않을 것이다. 이러한 점에서, 저장부(220)는 장치의 구성에서 생략되어 자연어 처리부(210)에 통합될 수 있을 것이다.The storage unit 220 includes a hardware such as a memory or a DB, and a software storage unit such as a registry. The storage unit 220 may store the entity name corpus information, basic and compound sentence parallel corpus information as described above. Actually, such information is output under the control of the natural language processing unit 210. However, the storage unit 220 may provide all the corresponding information at the time of initial operation of the system in response to a request of the natural language processing unit 210, And thus the information is not particularly limited in how the information is processed in the embodiment of the present invention. In this regard, the storage unit 220 may be omitted from the configuration of the apparatus and integrated into the natural language processing unit 210. [

한편, 통신 인터페이스부(200), 자연어 처리부(210) 및 저장부(220)는 서로 물리적으로 분리된 하드웨어 모듈로 구성되지만, 각 모듈은 내부에 상기의 동작을 수행하기 위한 소프트웨어를 저장하고 이를 실행할 수 있을 것이다. 다만, 해당 소프트웨어는 소프트웨어 모듈의 집합이고, 각 모듈은 하드웨어로 형성되는 것이 얼마든지 가능하므로 소프트웨어니 하드웨어니 하는 구성에 특별히 한정하지 않을 것이다. 예를 들어 저장부(220)는 하드웨어인 스토리지(storage) 또는 메모리(memory)일 수 있다. 하지만, 소프트웨어적으로 정보를 저장(repository)하는 것도 얼마든지 가능하므로 위의 내용에 특별히 한정하지는 않을 것이다. Meanwhile, the communication interface unit 200, the natural language processing unit 210, and the storage unit 220 are configured as hardware modules physically separated from each other. However, each module stores software for performing the above operations and executes the software It will be possible. However, the corresponding software is a set of software modules, and each module can be formed of hardware, so it will not be limited to the configuration of software or hardware. For example, the storage unit 220 may be a hardware storage or a memory. However, since it is possible to repository information in a software way, it is not limited to the above contents.

기타 자세한 내용은 계속해서 도 4 및 도 5를 참조하여 살펴보도록 한다. Other details will be described with reference to FIGS. 4 and 5. FIG.

도 3은 도 1의 자연어 처리 장치의 다른 세부 구조를 나타내는 도면으로, 자연어 처리 장치가 소프트웨어적으로 구성되는 것을 예시하여 나타낸 도면이며, 도 4는 도 3의 기본 및 복합 문장 병렬 말뭉치 정보의 예이다.Fig. 3 is a diagram showing another detailed structure of the natural language processing apparatus of Fig. 1, in which the natural language processing apparatus is constituted by software, and Fig. 4 is an example of basic and compound sentence parallel corpus information .

설명의 편의상 도 3을 도 1과 참조하면, 본 발명의 다른 실시예에 따른 자연어 처리 장치(120')는 자연어 처리 모듈(300) 및 저장 모듈(310)을 포함할 수 있다. 이는 도 2의 자연어 처리부(210) 및 저장부(220)가 될 수도 있다.Referring to FIG. 3, a natural language processing apparatus 120 'according to another embodiment of the present invention may include a natural language processing module 300 and a storage module 310. Referring to FIG. This may be the natural language processing unit 210 and the storage unit 220 of FIG.

그 기능을 도 2와 비교해 보면, 도 3의 자연어 처리 모듈(300)은 도 2의 통신 인터페이스부(200) 및 자연어 처리부(210)에 대응된다면, 저장 모듈(310)은 도 2의 저장부(220)에 대응될 수 있다.2, if the natural language processing module 300 of FIG. 3 corresponds to the communication interface unit 200 and the natural language processing unit 210 of FIG. 2, the storage module 310 may be a storage unit 220, respectively.

본 발명의 실시예에 따라, 자연어 처리 모듈(300)은 개체명 인식 실행부(300-1) 및 통계적 번역 실행부(혹은 번역 실행부)(300-2)만을 포함할 수 있지만, 나아가 개체명 인식 모델(300-3), 개체명 인식 훈련부(300-5), 통계적 번역 모델(300-7) 및 통계적 번역 훈련부(300-9)의 일부 또는 전부를 더 포함할 수 있다. 여기서, 일부 또는 전부를 포함한다는 것은 앞서서의 의미와 동일하다.According to the embodiment of the present invention, the natural language processing module 300 may include only the entity name recognition execution unit 300-1 and the statistical translation execution unit (or translation execution unit) 300-2, The recognition model 300-3, the object name recognition training unit 300-5, the statistical translation model 300-7, and the statistical translation training unit 300-9. Here, the inclusion of a part or the whole is the same as the preceding meaning.

개체명 인식 실행부(300-1)는 개체명 인식 모델(300-3)을 바탕으로 입력 문장에서 개체명, 더 정확하게는 개체명의 범주를 자동으로 찾아낸다. 예를 들어, "Do you know who starred in Harry Potter?"라는 문장에서 "Harry Potter"라는 단어가 movie임을 자동으로 찾아낸다.Based on the entity name recognition model 300-3, the entity name recognition execution unit 300-1 automatically finds the entity name, or more precisely, the category of the entity name in the input sentence. For example, in the sentence "Do you know who starred in Harry Potter?", It automatically finds that the word "Harry Potter" is a movie.

이를 위하여, 개체명 인식 훈련부(300-5)는 저장 모듈(310) 내에 포함된 개체명 말뭉치 즉 말뭉치 정보(310-1)를 바탕으로 개체명 인식을 수행할 수 있도록 개체명 인식 모델(300-3)을 학습(혹은 훈련)한다.To do this, the entity name recognition training unit 300-5 generates an entity name recognition model 300 - 1 based on the entity name corpus or the corpus information 310-1 included in the storage module 310, 3) to learn (or train).

이러한 개체명 인식 훈련을 위해서는 시스템 설계자 또는 사용자 등에 의해 생성된 개체명 말뭉치, 즉 말뭉치 정보(310-1)가 필요하다. 개체명 말뭉치(310-1)는 개체명에 해당하는 부분이 개체명 태그로 표시되어 있는 문장의 목록들로 구성될 수 있다. 예를 들어, 개체명 말뭉치의 "Who starred in <movie> Harry Potter<／movie>?"라는 문장은 "Harry Potter"라는 단어가 movie라는 개체명 범주에 속한다는 것을 의미한다.For such object name recognition training, the entity name corpus, that is, corpus information (310-1) generated by the system designer or the user is required. The entity name corpus 310-1 may be composed of lists of sentences in which the part corresponding to the entity name is indicated by the entity name tag. For example, the sentence "Who starred in <movie> Harry Potter </ movie>?" In the object name corpus means that the word "Harry Potter" belongs to the object name category movie.

또한 통계적 (기계) 번역 실행부(300-2)는 통계적 번역 모델(300-7)을 바탕으로 개체명 범주가 태깅된 문장을 복수의 기본 문장으로 분할 및 복원하여 출력한다. 예를 들어, 복합 문장 "record @movie and show me @movie"를 "record @movie", "show me @movie"라는 기본 문장으로 분할 및 복원하여 출력할 수 있다. 이의 과정에서, 원문 형태로 변환하여 출력할 수도 있다. 다시 말해, "record Harry Potter", "show me Gone with the Wind"라는 원문 형태의 문장을 출력할 수 있다.In addition, the statistical (machine) translation executing unit 300-2 divides and restores a tagged object name category into a plurality of basic sentences based on the statistical translation model 300-7. For example, you can split and restore the composite sentence "record @movie and show me @movie" into the basic sentences "record @movie" and "show me @movie". In the process, it is possible to convert it to the original form and output it. In other words, you can print the sentence in the form of "record Harry Potter" and "show me Gone with the Wind".

또한 통계적 번역 실행부(300-2)는 한국어의 "@movie 녹화하고 @movie 틀어줘"와 같이 복수의 기본 문장이 하나의 복합 문장 즉 중문을 형성할 때, 기본 문장 자체를 변형시키는 경우에도, "@movie 녹화해줘", "@movie 틀어줘"라는 기본 문장으로 분할 및 복원하여 출력할 수 있다.In addition, the statistical translation execution unit 300-2 may be configured so that even when a plurality of basic sentences form a compound sentence, i.e., a Chinese sentence, such as "@movie record and play @movie" It can be split and restored to the basic sentence "@movie record" and "@movie play" and output.

나아가 통계적 번역 실행부(300-2)는 기본 문장 즉 단문이 입력된 경우에는 하나의 기본 문장을 그대로 유지할 수 있을 것이다. 이의 경우에도 기본 및 복합 문장 병렬 말뭉치 정보(311)에 훈련된 통계 번역 모델(300-7)에 기반해서 제공될 수 있을 것이다.Furthermore, the statistical translation executing unit 300-2 can maintain a basic sentence when a basic sentence, that is, a short sentence is input. This case may also be provided based on the statistical translation model 300-7 trained in the basic and compound sentence parallel corpus information 311. [

상기의 실행을 위해, 통계적 번역 실행부(300-2)는 기본 및 복합 문장 병렬 말뭉치 정보(311)의 문장에서 개체명에 해당하는 부분에 상응하는 입력 문장의 개체명 값을 개체명 범주로 치환한다. 예를 들어, "Record OCN news"라는 문장에서 "OCN news"가 movie 개체명 범주인 경우, "Record @movie"라는 문장으로 치환한다.For the above execution, the statistical translation executing unit 300-2 substitutes the object name value of the input sentence corresponding to the portion corresponding to the object name in the sentence of the basic and compound sentence parallel corpora information 311 into the object name category do. For example, if "OCN news" in the "Record OCN news" category is the movie object name category, replace it with "Record @movie".

이때 개체명 치환은 개체명을 구성하는 단어들을 하나로 묶어 문장 분할 및 복원을 위한 통계적 번역의 성능을 높이고자 함이다. 이와 관련해서는 앞서 충분히 설명한 바 있다.In this case, the object name substitution is intended to improve the performance of statistical translation for word segmentation and restoration by grouping the words constituting the object name. This has been fully explained above.

상기와 같은 실행을 위하여, 자연어 처리 모듈(300)은 통계적 번역 훈련부(300-9)를 통해 통계적 번역 모델(300-7)을 훈련시킨다. 이를 위해, 기본 및 복합 문장 병렬 말뭉치 정보(311)가 필요하게 되는 것이다. For such an implementation, the natural language processing module 300 trains the statistical translation model 300-7 through the statistical translation training section 300-9. For this, basic and complex sentence parallel corpus information 311 is required.

기본 및 복합 문장 병렬 말뭉치 정보(311)는 복수의 기본 문장과 그에 부합하는 하나의 복합 문장이 쌍을 이루어 목록을 구성한다. 이는 도 4에 도시된 바와 같다. 도 4의 (a) 내지 (c)에서 입력(input)은 목록에서 하나의 복합 문장에 해당된다. 이는 치환된 정보이다. 또한 출력(output)은 치환된 정보에 매칭되는 복수의 기본 문장에 해당된다. 또한 항목 1(item 1)은 접속사 없이 복합 문장이 치환되어 입력된 경우를 나타내고, 항목 2(item 2)는 접속사(and)를 포함하는 복합 문장의 예이다. 또한 항목 3(item 3)은 단문이 입력된 경우를 나타낸다. 앞서 언급한 대로, 이러한 말뭉치 정보(311)는 DB에 2 비트 정보의 형태로 저장될 수도 있다.The basic and complex sentence parallel corpus information 311 forms a list by combining a plurality of basic sentences and a corresponding compound sentence. This is as shown in FIG. In Figures 4 (a) to 4 (c), the input corresponds to one compound sentence in the list. This is substituted information. The output also corresponds to a plurality of basic sentences matching the substituted information. Item 1 (item 1) represents a case where a compound sentence is substituted without a conjunction, and item 2 (an item 2) is an example of a compound sentence including a conjunction (and). Item 3 (item 3) indicates a case where a short text is inputted. As described above, the corpus information 311 may be stored in the DB in the form of 2-bit information.

도 4에서 볼 때, 기본 및 복합 문장 병렬 말뭉치 정보(311)의 복수의 기본 문장은 식별자, 가령 "#"과 같은 특정한 문장 경계 구분 기호로 연결된다. 예를 들어, "Record OCN news # show me Family Guy"는 "Record OCN news", "Show me Family Guy"의 두 문장을 연결한 것이다. 4, a plurality of basic sentences of the basic and compound sentence parallel corpus information 311 are connected to a specific sentence boundary delimiter such as an identifier, e.g., "# ". For example, "Record OCN news # show me Family Guy" concatenates the two sentences "Record OCN news" and "Show me Family Guy".

이러한 기본 및 복합 문장 병렬 말뭉치 정보(311)는 기본 문장 말뭉치 정보(317)로부터 자동 또는 수동으로 생성된다(313, 315). 기본 문장 말뭉치 정보(317)는 기본 문장의 목록으로 구성된다. 예를 들어, "Record OCN news"는 하나의 기본 문장이다. 이러한 기본 말뭉치 정보(317)는 개체명 말뭉치 정보(310-1)와 별도로 구축될 수 있으며, 동일한 저장 모듈이 사용될 수 있을 것이다.The basic and complex sentence parallel corpora information 311 is generated automatically or manually from basic sentence corpus information 317 (313, 315). The basic sentence corpus information 317 is composed of a list of basic sentences. For example, "Record OCN news" is a basic sentence. The basic corpus information 317 may be constructed separately from the entity name corpus information 310-1, and the same storage module may be used.

복합 문장 자동 생성 모듈(310)은 기본 문장 말뭉치 정보(317)의 2개 이상의 기본 문장을 연결해 자동으로 복합 문장을 생성해 낸다. 예를 들어, "Record OCN news"라는 기본 문장과 "Show me Family Guy"라는 기본 문장을 연결해 자동으로 "Record OCN news and show me Family Guy"라는 복합 문장을 생성한다.The automatic compound sentence generation module 310 automatically generates a compound sentence by connecting two or more basic sentences of the basic sentence corpus information 317. For example, a basic sentence of "Record OCN news" and a basic sentence of "Show me Family Guy" are automatically created and a compound sentence called "Record OCN news and show me Family Guy" is created.

그러나, 복합 문장 자동 생성 모듈(313)에서 생성하기 어려운 복합 문장의 경우에는 복합 문장 수동 생성 모듈(315)을 이용해 시스템 설계자 또는 사용자가 수동으로 생성해 줄 수도 있을 것이다. 복합 문장 수동 생성 단계에서 기본 문장 말뭉치 정보(317)의 2개 이상의 기본 문장을 바탕으로 시스템 설계자 또는 사용자가 수동으로 복합 문장을 생성하는 것이다. 나아가, 복합 문장 수동 생성 과정에서 가령 시스템 설계자는 직접 기본 문장을 생각해 내어 그 문장을 바탕으로 수동으로 복합 문장을 생성할 수 있을 것이다. 이와 같이 다양한 형태로 기본 및 복합 문장 병렬 말뭉치 정보(311)를 구축할 수 있으므로, 본 발명의 실시예에서는 어떠한 방식으로 정보를 구축하는지에 특별히 한정하지는 않을 것이다.However, in the case of a compound sentence which is difficult to be generated by the compound sentence automatic generation module 313, the system designer or the user may manually generate the compound sentence using the compound sentence manual generation module 315. [ The system designer or the user manually generates a compound sentence based on two or more basic sentences of the basic sentence corpus information 317 in the step of manually generating a compound sentence. Furthermore, in the process of manually generating a compound sentence, for example, a system designer may manually generate a compound sentence based on the sentence. Since the basic and compound sentence parallel corpus corpora information 311 can be constructed in various forms as described above, the embodiment of the present invention is not particularly limited to how information is constructed.

통계적 번역 훈련부(300-9)는 기본 및 복합 문장 병렬 말뭉치 정보(317)를 바탕으로 통계적 번역 모델(300-7)을 훈련한다. The statistical translation and training department 300-9 trains the statistical translation model 300-7 based on the basic and complex sentence parallel corpus information 317.

그리고, 통계적 번역 실행부(300-2)는 훈련된 통계적 번역 모델을 기반으로, 입력된 단문의 복원 또는 중문을 분할 및 복원하여 출력하는 것이다. Then, the statistical translation executing unit 300-2 divides and restores the inputted short text or divides the Chinese text based on the trained statistical translation model, and outputs it.

상기의 구성결과, 자연어 처리 모듈(300)은 중문과 같은 입력 문장에 대하여 개체명 인식을 수행하고 그 결과를 바탕으로 통계적 번역을 수행해 복합 문장을 분할하고, 복원된 복수의 기본 문장을 생성함으로써 자연어 처리 시스템(110')의 성능을 개선시킬 수 있을 것이다. As a result of the above configuration, the natural language processing module 300 performs entity name recognition on an input sentence such as Chinese text, divides a compound sentence by performing statistical translation based on the result, and generates a plurality of restored basic sentences The performance of the system 110 'may be improved.

도 3에서의 훈련 및 실행의 개념을 좀더 구체적으로 설명하면 다음과 같다.The concept of training and execution in FIG. 3 will be described in more detail as follows.

도 3에서 언급한 실행(execution)이란 기계 즉 장치가 어떠한 입력을 받아 사람이 궁금해 하는 것을 알아내어 출력하는 것을 의미한다. 예를 들어 사람 얼굴을 보고 감정을 알아내는 감정인식의 경우, 입력은 사람 얼굴이며 출력은 사람의 감정(ex. 기쁨, 슬픔, ...)이라 할 수 있다. 이러한 실행을 위해 통계 기반의 접근법을 취하는 경우, 실행은 알고리즘(algorithm)과 모델(model)로 구성될 수 있다. 알고리즘은 실행부에 포함되어 있는 것이며, 모델은 입력과 출력의 관계를 기술한 것이다. 예를 들어 사람이 기쁠 때는 어떤 표정을 짓더라, 슬플 때는 어떤 표정을 짓더라 하는 것이 일종의 모델이며 수학적으로 표현된다. 알고리즘이란 이러한 모델을 바탕으로 사람 얼굴을 보고 감정을 알아내는 과정이 기술된다.The execution referred to in FIG. 3 means that a machine or device receives input and finds out what the person is curious about and outputs. For example, in the case of emotional recognition that recognizes emotions by looking at a human face, the input is a human face and the output is a person's emotions (eg, joy, sorrow, ...). When taking a statistical-based approach to this implementation, execution can consist of an algorithm and a model. The algorithm is included in the execution part, and the model describes the relationship between input and output. For example, when a person is happy, he or she makes a facial expression, while when it is sad, what kind of expression is a model and mathematically expressed. Algorithms describe the process of finding emotions by looking at human faces based on these models.

위에서 실행은 기계가 입력을 받아 궁금한 것을 알아내 출력하는 것이라고 설명하였다. 훈련(training)은 기계가 입력과 그에 대한 출력을 모두 받아 입력과 출력의 관계에 대한 모델을 구축하는 것이다. 이러한 훈련을 위해서는 사람이 직접입력과 출력의 쌍으로 구성된 훈련 데이터를 만들어 주어야 한다. 예를 들어, 1000개의 얼굴 이미지가 있을 때 각 이미지의 사람 얼굴이 표현하는 사람의 감정 등을 기록해 놓는다면 얼굴 이미지를 입력, 감정을 출력으로 하는 훈련 데이터가 되는 것입니다. 자연어 처리 분야에서는 이러한 데이터를 말뭉치(corpus)라고 명명한다. Execution above explained that the machine receives input and finds out what it is curious about. Training is to build a model of the relationship between input and output, where the machine receives both input and output. For this training, one must make training data consisting of a pair of direct input and output. For example, when 1000 face images are recorded, if the person's face of each image records the emotions of the person expressing, the training data is inputting the face image and outputting the emotion. In the field of natural language processing, this data is called a corpus.

이러한 훈련 및 실행 관점에서 이하, 개체명 인식과 (기계) 번역에 대하여 살펴보도록 한다.From this training and implementation perspective, we will look at entity recognition and (machine) translation.

개체명 인식은 "바람과 함께 사라지다 시작시간 알려줘"와 같은 문장에서 "바람과 함께 사라지다"가 movie 범주(혹은 유형)의 개체명임을 알아내는 것을 의미한다. 즉 입력은 자연어 문장이며, 출력은 개체명 범주라 할 수 있다. 삼성전자 TV가 다루고 있는 개체명 범주는 영화제목(movie), 영화장르(genre), 출연진(actor), 시간(time) 등일 수 있다.The recognition of the object name means to find out that "disappear with the wind" is an object name of the movie category (or type) in a sentence such as "Let the wind disappear with the start time". That is, the input is a natural language sentence, and the output is the object name category. The object name category handled by Samsung Electronics TV may be movie title, genre, actor, time, etc.

이에 따라, 가령 도 3의 개체명 말뭉치 정보(310-1)는 개체명의 유형이 부착된 문장으로 구성될 수 있다. 예를 들어 아래와 같이 표현될 수 있다.Accordingly, for example, the entity name corpus information 310-1 of FIG. 3 may be composed of sentences having a type of entity name attached thereto. For example, it can be expressed as follows.

/죽은 시인의 사회/movie/ 시작시간 알려줘./ Dead Poet Society / Movie / Let me know the start time.

/글래디에이터/movie/ 시작시간 알려줘./ Gladiator / movie / let me know the start time.

/오늘/time/ /유*석/actor/ 나온 프로(그램) 뭐 있지?/ Today / time / / u / se / actor / pro (gram) that came out?

그러면, 개체명 인식 훈련부(300-5)는 위와 같은 개체명 말뭉치를 바탕으로 문장 및 개체명의 관계에 대한 개체명 인식 모델(300-3)을 구축한다. 위의 훈련 데이터를 보면 영화이름(movie)을 나타내는 개체명 우측에 "시작시간 알려줘"와 같은 구절이 나오는 것을 알 수 있다. 이러한 지식은 가령 수학적으로 표현될 수 있다.Then, the entity name recognition training unit 300-5 constructs an entity name recognition model 300-3 for the relationship between sentences and entity names based on the entity name corpus. In the training data above, you can see the phrase "Let me know the start time" on the right side of the object name that represents the movie name. Such knowledge can be expressed, for example, mathematically.

개체명 인식 실행부(300-1)는 입력 문장이 주어졌을 때 개체명 인식 모델(300-3)을 바탕으로 개체명 즉 개체명 범주를 찾아내는 것이다.The entity name recognition execution unit 300-1 finds the entity name, that is, the entity name category, based on the entity name recognition model 300-3 when an input sentence is given.

이어, 문장 분할 및 복원을 위한 (기계) 번역을 살펴보면, 가령 번역은 "Train is fast"라는 특정 언어의 문장을 "기차는 빠르다"와 같이 다른 언어의 문장으로 변환하는 것을 의미한다. 다만, 본 발명의 실시예에서는 "Record OCN news and show me Family Guy"라는 복합 문장을 "Record OCN news" 및 "show me Family Guy" 와 같은 여러 개의 기본 문장으로 분할하는 것을 목표로 한다. 따라서, 이러한 과정을 체계적으로 표현하기 위해 여러 개의 기본 문장을 "Record OCN news # show me Family Guy "와 같이 문장 경계를 '#' 기호를 이용해 구분되어 있는 문장으로 표현한다.Then, looking at the (machine) translation for sentence segmentation and restoration, for example, the translation means translating the sentence of the specific language "Train is fast" into a sentence of another language, such as "train is fast". However, in the embodiment of the present invention, it is aimed to divide the compound sentence "Record OCN news and show me Family Guy" into several basic sentences such as "Record OCN news" and "Show me Family Guy". Therefore, in order to systematically express such a process, several basic sentences are expressed in sentence boundaries separated by '#' symbol like "Record OCN news # show me Family Guy".

즉 입력은 복합 문장이며, 출력은 복수의 기본 문장이라 볼 수 있다. 이때 번역 즉 변경의 대상이 되는 문장에, 앞서 언급한 개체명 인식 기술이 적용된다. 예를 들어 "Record OCN news and show me Family Guy" 라는 문장에서 "OCN news"와 "Family Guy"가 프로그램 이름(movie)임을 판단해 내어 "Record @movie and show me @movie"라는 문장으로 치환한다. 개체명 인식 기술을 적용하는 이유는 첫째로 개체명 안에서 문장 경계가 나뉘는 것을 막는 것이고(예. "Family"와 "Guy" 사이에서 나뉘는 것을 방지), 둘째로 영화 이름(movie) 개체명과 같이 서로 다른 단어이지만 의미적으로는 동일한 것들을 일반화시켜 표현함으로써 기계 번역 관점에서 정확도를 높이기 위함이다.That is, the input is a compound sentence, and the output can be regarded as a plurality of basic sentences. At this time, the above-mentioned object name recognition technique is applied to a sentence to be translated or changed. For example, in the sentence "Record OCN news and show me Family Guy", it is judged that "OCN news" and "Family Guy" are program names and substitute "Record @movie and show me @movie" . The reason for applying the entity name recognition technique is to prevent the boundary of the sentence boundary from being separated first in the object name (eg, to prevent it from being split between "Family" and "Guy"), and second, It is intended to enhance the accuracy of the machine translation view by expressing the same thing as a word but semantically the same.

기본 및 복합 문장 병렬 말뭉치 정보(311)는 번역의 입력 및 출력으로 구성되어 있으며, 자동 및 수동으로 생성된다. 기본 및 복합 문장 병렬 말뭉치는 <표 1>과 같이 정리해 볼 수 있다. <표 1>은 도 4의 내용을 포함한다.The basic and complex sentence parallel corpus information 311 consists of input and output of translation and is generated automatically and manually. The basic and compound sentence parallel corpus can be summarized as in <Table 1>. Table 1 contains the contents of FIG.

입력

input

출력
Print
record @movie show me @movie

record @movie show me @movie

record @movie # show me @movie
record @movie # show me @movie
record @movie and show me @movie

record @movie and show me @movie

record @movie # show me @movie
record @movie # show me @movie
record @movie

record @movie

record @movie
record @movie
who starred in @movie ?

who starred in @movie?

who starred in @movie ?
who starred in @movie?

통계적 번역 훈련부(300-9)는 위와 같은 말뭉치를 바탕으로 입력 문장과 출력 문장의 관계에 대한 통계적 번역 모델(300-7)을 구축한다. 위의 훈련 데이터를 보면 "record @movie and show me @movie" 사이에서 "and" 가 "#"으로 치환(혹은 번역)되고 다른 단어들은 자기 자신으로 치환되는 것을 알 수 있으며, 이러한 지식을 수학적으로 표현한다. 물론 2비트 정보로 표현할 수 있고, 이러한 형태를 DB에 저장한 후 이용하는 것도 얼마든지 가능할 수 있을 것이다.The Statistical Translation and Training Department (300-9) builds a statistical translation model (300-7) on the relationship between the input sentence and the output sentence based on the above corpus. The above training data show that between "record @movie and show me @movie", "and" is replaced (or translated) by "#" and other words are replaced by oneself, Express. Of course, it can be represented by 2-bit information, and it can be used anyway after storing this form in DB.

통계적 번역 실행부(300-2)는 입력 문장이 주어졌을 때 통계적 번역 모델(300-7)을 바탕으로 번역 과정을 수행함으로써 복합 문장으로부터 여러 개의 기본 문장을 분할하고 복원해 내게 된다. 이때 입력 문장은 개체명 인식 실행부(300-1)에 의해 개체명 즉 개체명 범주가 발견되어 개체명이 치환된 형태의 문장이다.The statistical translation executing unit 300-2 divides and restores a plurality of basic sentences from a compound sentence by performing a translation process based on the statistical translation model 300-7 when an input sentence is given. At this time, the input sentence is a sentence in which the entity name, that is, the entity name category is found by the entity name recognition execution unit 300-1 and the entity name is substituted.

도 5는 도 2의 자연어 처리부 또는 도 3의 자연어 처리 모듈에서 처리되는 입력 문장의 변형 예를 나타내는 도면이다.FIG. 5 is a diagram showing a modified example of an input sentence processed by the natural language processing unit of FIG. 2 or the natural language processing module of FIG. 3;

설명의 편의상 도 5를 도 3과 함께 참조하면, 도 4의 개체명 인식 실행부(300-1)는 도 5의 (a)에서와 같이 "Harry Potter 주연이 누군지 알려주고, KBS 틀어줘"라는 문장을 입력받을 수 있다.Referring to FIG. 5 together with FIG. 3 for convenience of explanation, the entity name recognition execution unit 300-1 of FIG. 4 inserts a sentence "Tells who is the main character of Harry Potter and KBS plays" Can be input.

그러면, 개체명 인식 실행부(300-1)는 입력 문장에 대하여 개체명 인식 모델(300-3)을 통해 도 5의 (b)에서와 같이 입력 문장의 개체명에 대한 개체명 범주를 자동으로 찾아낸다. 이를 통해, "Harry Potter"는 영화이고, "KBS"는 채널명이라는 정보를 얻게 된다. 이와 같은 개체명 범주 정보가 입력 문장과 같이 제공되는 것을, 본 발명의 실시예에서는 개체명이 태깅된, 또는 개체명 범주가 태깅된 입력 문장이라 명명할 수 있다.Then, the object name recognition execution unit 300-1 automatically inputs the object name category for the object name of the input sentence, as shown in FIG. 5 (b), through the object name recognition model 300-3 for the input sentence Find out. Through this, "Harry Potter" is a movie and "KBS" is a channel name. In the embodiment of the present invention, the entity name category information is provided as an input sentence, and the entity name is tagged or the entity name category is tagged input sentence.

그리고, 도 3의 통계적 번역 실행부(300-2)는 입력 문장에서, 개체명은 개체명 유형으로 치환한다. 이에 따라, 도 5의 (c)에서와 같이 입력 문장은 "@movie 주연 누군지 알려줘, @channel name 틀어줘"라는 문장으로 변경된다.The statistical translation executing unit 300-2 of Fig. 3 replaces the entity name with the entity name type in the input sentence. Accordingly, as shown in FIG. 5C, the input sentence is changed to a sentence such as "Tell me who @movie is the main character, @channel name is played".

그러면, 통계적 번역 실행부(300-2)는 통계적 번역 모델(300-7)을 통해 도 5의 (d)에서와 같이 "movie 주연 누군지 알려줘 # @channel name 틀어줘"라는 문장을 생성(또는 추출)할 수 있다.Then, the statistical translation executing unit 300-2 generates (or extracts) a sentence " Let me know who the main character is in the movie # @ channel name "through the statistical translation model 300-7 as shown in (d) )can do.

이를 근거로 통계적 번역 실행부(300-2)는 문장의 경계를 구분짓는 식별자로서 문장 구분 기호(#)를 근거로 기본 문장의 경계를 판단하여, 각 기본 문장에 대한 원본 문장, 즉 "Harry Potter 주연이 누군지 알려줘"와 "KBS 틀어줘"라는 문장을 생성해 출력한다.On the basis of this, the statistical translation execution unit 300-2 judges the boundary of the basic sentence based on the sentence delimiter (#) as an identifier for distinguishing the boundaries of the sentence, and calculates the original sentence for each basic sentence, that is, "Harry Potter Create a sentence that says, "Tell me who is the main character" and "Let me play KBS."

이러한 과정은 결국, 통계적 번역 실행부(300-2)가 입력 문장에 대한 분할을 수행하면서, 동시에 원본 문장 즉 기본 문장을 복원하는 것이라 볼 수 있다.This process can be regarded as restoring the original sentence, that is, the basic sentence, while the statistical translation executing unit 300-2 performs the division on the input sentence.

도 6은 본 발명의 실시예에 따른 자연어 처리 과정을 나타내는 도면이다.6 is a diagram illustrating a natural language processing process according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 실시예에 따른 자연어 처리 장치(120)는 문장 내의 단어에 대한 개체명 범주들과, 문장의 경계를 구분짓는 식별자가 포함된 말뭉치 정보를 저장한다(S600). 여기서, 말뭉치 정보는 도 3에 도시된 기본 및 복합 문장 병렬 말뭉치 정보로 이해해도 좋다. 물론 S600 단계에는 저장한 말뭉치 정보를 근거로 통계적 번역 모델을 훈련한 상태를 포함할 수 있다.Referring to FIG. 6, the natural language processing apparatus 120 according to the embodiment of the present invention stores corpus information including an entity name category for a word in a sentence and an identifier for separating a boundary of the sentence (S600). Here, the corpus information may be understood as basic and compound sentence parallel corpus information shown in FIG. Of course, the step S600 may include a state in which the statistical translation model is trained based on the stored corpus information.

이어 자연어 처리 장치(120)는 사용자 장치(100)로부터 사용자가 자연어로서 입력한 중문을 수신한다(S610). 여기서, 중문은 복수의 단문이 접속사를 포함하거나 접속사 없이 연결되어 형성된 문장을 의미한다. 또한 접속사 없이 연결된다는 것은 일정한 시간 간격을 두고 연속으로 단문이 제공되는 경우를 나타낸다.Then, the natural language processing device 120 receives the Chinese text input by the user as a natural language from the user device 100 (S610). Here, a Chinese sentence means a sentence formed by a plurality of short sentences including a conjunction or connected without a conjunction. Also, connecting without a connection shows a case where a short sentence is continuously provided at regular time intervals.

자연어 처리 장치(120)는 수신된 중문에서 2개의 기본 문장에 포함된 단어, 즉 개체명의 개체명 범주를 판단한다(S620). 이는 입력 문장에 훈련되어 있는 개체명 인식 모델에 의해 결정될 수 있다. 예를 들어, 서로 다른 중문에 Harry Potter가 포함된다고 하더라도 이는 영화 제목일 수 있고, 인물을 의미할 수도 있다. 따라서, 이러한 동일 개체명이라 하더라도 다양한 문장에 훈련되어 있는 개체명 인식 모델에 의해 이러한 개체명 범주가 정확히 판단될 수 있다. 뿐만 아니라, 서로 다룬 중문에 'Harry Potter'와 '바람과 함께 사라지다'가 각각 있다면 이는 동일한 개체명 범주, 즉 영화 범주로 판단될 수 있다.The natural language processing unit 120 determines words included in the two basic sentences in the received Chinese sentence, that is, the object name category of the object name (S620). This can be determined by the entity name recognition model trained in the input sentence. For example, even if Harry Potter is included in different Chinese texts, it could be a movie title or a character. Therefore, even though the same entity name is used, the entity name category can be accurately determined by the entity name recognition model trained in various sentences. In addition, if there is "Harry Potter" and "disappearing with the wind" in the interlocutors, they can be judged to be the same object category, or movie category.

이와 같이 개체명 범주가 판단되면, 자연어 처리 장치(120)는 판단된 개체명 범주에 관련되는 발뭉치 정보를 얻어 이를 근거로 복수의 단문을 생성한다(S630). 이의 과정에서 자연어 처리 장치(120)는 입력된 중문 내의 개체명 속성이 판단되면, 일례로서 중문 내의 개체명에 해당되는 단어를 개체명 범주로 치환한다. 그리고 치환된 개체명 범주를 포함하는 중문을 통계적 번역 모델에 적용하여 말뭉치 정보를 얻는다. 여기서, 출력되는 말뭉치 정보는 입력된 중문에서 개체명 범주가 치환된 형태와 동일한 형태를 가지되, 두 문장 사이에 문장 구분 기호와 같은 식별자를 포함한다는 점에서 차이가 있다.When the entity name category is determined as described above, the natural language processing device 120 obtains foot group information related to the determined entity name category and generates a plurality of short phrases based on the obtained foot name information (S630). In this process, the natural language processing unit 120 replaces the word corresponding to the object name in the Chinese character with the object name category, for example, when the attribute of the object name in the input Chinese character is determined. Then, Chinese sentences including the replaced object name category are applied to the statistical translation model to obtain corpus information. Here, the corpus information to be output differs in that the corpus information has the same form as the form in which the entity name category is substituted in the input Chinese sentence, and includes an identifier such as a sentence separator between the two sentences.

이에 따라, 자연어 처리 장치(120)는 문장을 구분하는 식별자를 근거로 입력된 중문에서 두 문장의 경계를 확인하고 이를 근거로 중문을 복수의 단문으로 분할하여 원 문장을 복원하게 된다. 이와 관련해서는 도 4 및 도 5를 참조하여 충분히 설명하였으므로 더 이상의 설명은 생략한다.Accordingly, the natural language processing unit 120 identifies the boundary between two sentences in the middle sentence based on the identifiers for distinguishing sentences, and divides the middle sentence into a plurality of short sentences based on the identifications, thereby restoring the sentence. Since this has been described fully with reference to Figs. 4 and 5, further explanation is omitted.

이후 자연어 처리 장치(120)는 생성된 복수의 단문을 사용자 장치(100)로 제공할 수 있다(S640). 이때 제공되는 복수의 단문은 다양한 형태를 가질 수 있다. 예를 들어, 사용자 장치(100)가 DTV와 같이 음성 명령을 처리하는 경우, 복수의 단문은 예를 들어, 'Harry Potter 녹화해줘'. 'KBS 틀어줘'와 같이 2개의 단문으로 제공될 수 있다. 또한 잡담이나 채팅을 수행하는 PC와 같은 영상표시장치의 경우 자연어 처리 장치(120)는 해당 문장을 원 접속사와 연결한 후 다시 제공할 수 있거나, 다른 언어로 번역하여 제공할 수 있다. 이의 경우, 2개의 단문을 각각 번역한 후 접속사를 연결하는 형태로 동작이 이루어질 수 있다. 이와 같이 분할 후 복원된 원 문장은 다양한 형태로 변경되어 사용자 장치(100)로 제공될 수 있을 것이다.Thereafter, the natural language processing device 120 may provide the generated plurality of short sentences to the user device 100 (S640). The plurality of short sentences provided at this time may have various forms. For example, when the user device 100 processes a voice command such as a DTV, a plurality of short sentences, for example, 'Record Harry Potter'. Can be provided in two short sentences such as 'Let's play KBS'. Also, in the case of a video display device such as a PC or a chatting PC, the natural language processing device 120 may connect the sentence to the original connection and then provide it again or translate it into another language. In this case, the operation can be performed in such a manner that the two short sentences are respectively translated and then the connection is connected. The original sentence restored after the division may be changed into various forms and may be provided to the user device 100.

도 7은 본 발명의 제1 실시예에 따른 자연어 처리 방법을 나타내는 흐름도이다.7 is a flowchart showing a natural language processing method according to the first embodiment of the present invention.

설명의 편의상 도 7을 도 6과 함께 참조하면, 본 실시예에 따른 자연어 처리 장치(120)는 문장 내의 단어, 더 정확하게는 개체명 단어에 대한 개체명 범주들과, 문장의 경계를 구분짓는 식별자가 포함된 말뭉치 정보를 저장한다(S700). 여기서, 저장은 저장한 말뭉치 정보를 근거로 통계적 번역 모델을 훈련한 상태를 포함한다.7, the natural language processing apparatus 120 according to the present embodiment includes object name categories for a word in a sentence, more precisely, an object name word, and an identifier for distinguishing the boundaries of the sentence Is stored (S700). Here, the storage includes a state in which the statistical translation model is trained based on the stored corpus information.

이어 자연어 처리 장치(120)는 사용자 장치(100)로부터 사용자가 자연어로서 입력한 중문을 수신한다(S710). 여기서, 중문은 복수의 단문이 접속사를 포함하거나 접속사 없이 연결되어 형성된 문장을 의미한다. 또한 접속사 없이 연결된다는 것은 일정한 시간 간격을 두고 연속으로 단문이 제공되는 경우를 나타낸다.Next, the natural language processing device 120 receives the Chinese text input by the user as a natural language from the user device 100 (S710). Here, a Chinese sentence means a sentence formed by a plurality of short sentences including a conjunction or connected without a conjunction. Also, connecting without a connection shows a case where a short sentence is continuously provided at regular time intervals.

이후 자연어 처리 장치(120)는 수신된 중문에서 판단된 개체명 범주에 관련되는 말뭉치 정보의 식별자를 근거로 복수의 단문을 생성하고(S720), 이후 생성한 복수의 단문을 사용자 장치(100)로 제공할 수 있다.Then, the natural language processing unit 120 generates a plurality of short sentences based on the identifiers of the corpus information related to the object name categories determined in the received Chinese sentences (S720), and then transmits the generated short sentences to the user device 100 .

그 이외에, 도 7과 관련한 자세한 내용은 도 6을 참조하여 충분히 설명하였으므로 더 이상의 설명은 생략하도록 한다.In addition to the above, the details relating to FIG. 7 have been fully described with reference to FIG. 6, and a further explanation will be omitted.

도 8은 본 발명의 제2 실시예에 따른 자연어 처리 방법을 나타내는 흐름도이다.8 is a flowchart showing a natural language processing method according to the second embodiment of the present invention.

설명의 도 8을 도 6과 함께 참조하면, 본 발명의 제2 실시예에 따른 자연어 처리 장치(120)는 문장 내의 개체명 단어에 대한 개체명 범주들과, 문장의 경계를 구분짓는 식별자가 포함된 말뭉치 정보에 훈련된 번역 모델을 구축한다(S800).Referring to FIG. 8 of the accompanying drawings, the natural language processing apparatus 120 according to the second embodiment of the present invention includes object name categories for object name words in a sentence and an identifier for distinguishing boundaries of sentences The trained translation model is constructed in the corpus information (S800).

이어 자연어 처리 장치(120)는 사용자 장치(100)로부터 사용자가 자연어로서 입력한 중문을 수신한다(S810). 여기서, 중문은 복수의 단문, 가령 제1 단문 및 제2 단문이 접속사를 포함하거나 접속사 없이 연결되어 형성된 문장이다. 또한 접속사 없이 연결된다는 것은 일정한 시간 간격을 두고 연속으로 단문이 제공되는 경우를 나타낸다.Then, the natural language processing device 120 receives the Chinese text input by the user as a natural language from the user device 100 (S810). Here, a Chinese sentence is a sentence composed of a plurality of short sentences, for example, a first short sentence and a second short sentence including a conjunction or connected without a conjunction. Also, connecting without a connection shows a case where a short sentence is continuously provided at regular time intervals.

또한 자연어 처리 장치(120)는 수신한 중문 내의 개체명 단어를 개체명 범주로 변경한다(S820). 예를 들어, 입력된 중문 내의 개체명 단어가 서로 다르다 하더라도 동일 범주를 갖는다면, 동일한 범주값으로 치환된다고 볼 수 있다. 다시 말해, 개체명 단어가 '바람과 함께 사라지다'와 'Harry Potter'로 서로 다르다 해도, 영화라는 동일한 범주에 속하므로 개체명 단어는 범주값인 가령 '@movie'로 치환된다. 본 발명의 실시예에서는 입력된 중문에 대하여 문장의 경계를 빠르게 예측하여 2개의 문장으로 분할하기 위한 것이므로 성능이 그만큼 빨라진다고 볼 수 있다.In addition, the natural language processing apparatus 120 changes the object name word in the received Chinese sentence into the object name category (S820). For example, even if the object name words in the input Chinese sentences are different from each other, they can be regarded as being replaced with the same category value if they have the same category. In other words, even though the object name is different from 'Harry Potter' and 'Harry Potter', since the movie belongs to the same category, the object name word is replaced with the category value, for example, '@movie'. In the embodiment of the present invention, since the boundary of the sentence is quickly predicted and the sentence boundary is divided into two sentences with respect to the input Chinese sentence, the performance can be considered to be much faster.

이어 자연어 처리 장치(120)는 개체명 범주로 치환된 중문을 번역 모델에 적용해 개체명 범주에 관련된 말뭉치 정보를 출력(혹은 생성)한다(S830). 여기서, 번역 모델은 다양한 말뭉치 정보에 훈련됨으로써 입력된 중문에서 판단된 개체명 범주에 관련된 말뭉치 정보를 출력하게 되는 것이다.Then, the natural language processing apparatus 120 applies (or generates) the corpus information related to the object name category by applying the Chinese text substituted for the object name category to the translation model (S830). Here, the translation model trains the various corpus information, thereby outputting the corpus information related to the category of the object name determined in the input Chinese sentence.

그리고 자연어 처리 장치(120)는 생성된 말뭉치 정보의 식별자 즉 문장 구문 기호를 근거로 입력된 중문을 분할하고, 원 문장으로 복원한다. 예를 들어, "Harry Potter 녹화해 주고, KBS 틀어줘"에 대하여, 말뭉치 정보를 근거로 "Harry Potter 녹화해 주고,"와 "KBS 틀어줘"의 2개 문장으로 분할한 후, 원 문장 즉 단문의 형태의 "Harry Potter 녹화해줘"와 "KBS 틀어줘"의 2개 문장으로 복원할 수 있다.Then, the natural language processing unit 120 divides the input Chinese sentence based on the identifier of the generated corpus information, that is, the sentence syntax symbol, and restores the original Chinese sentence. For example, based on the corpus information, "Harry Potter recording and KBS play" is divided into two sentences: "Record Harry Potter" and "Play KBS." Then, Can be restored in two sentences: "Let me record Harry Potter" and "Let me play KBS."

상기한 바와 같이 본 발명의 실시예에서는 수학적 모델과 같은 개체명 인식 모델 및 통계적 번역 모델을 적용하는 것을 넘어, 개체명 범주 정보만을 근거로 DB를 검색하는 것과 같이 말뭉치 정보를 검색하여 이를 근거로 복수의 단문 즉 2개의 단문을 생성하는 방법을 기술하였다. 그러나, 그 이외에도 본 발명의 실시예는 중문을 단문으로 분할하기 위하여, 입력된 문장에서 개체명 범주를 파악하고 이를 근거로 두 문장의 경계를 따르게 파악하여 이를 근거로 문장을 분할할 수만 있다면 다른 어떠한 방법도 더 포함할 수 있을 것이다.As described above, in the embodiment of the present invention, not only the object name recognition model such as a mathematical model and the statistical translation model are applied, but the corpus information is searched in the DB based on only the object name category information, A method of generating two short sentences is described. However, in the embodiment of the present invention, in order to divide Chinese sentences into short sentences, it is necessary to identify the category of the object name in the inputted sentence, to grasp the boundaries of the two sentences based on the obtained sentence, Method.

나아가, 지금까지는 수신된 단문, 중문 및 복문과, 이들의 조합에 의한 형태(ex. 단문 + 중문, 단문 + 복문, 중문 + 복문)의 문장에 대하여 복수의 단문으로 처리하는 것을 기술하였지만, 분할하여 복원된 복수의 단문을 다시 결합한 중문 또는 복문의 형태로 도 1의 사용자 장치(100)에 얼마든지 제공될 수 있으므로 본 발명의 실시예에서는 위의 단문에 특별히 한정하지는 않을 것이다.Furthermore, although a description has been given of processing a plurality of sentences of sentences of received short sentences, middle sentences and sentences, and combinations thereof (eg, short sentences + middle sentences, short sentences + complex sentences, and middle sentences + complex sentences) The restored plurality of short sentences can be provided to the user device 100 of FIG. 1 in the form of a Chinese sentence or a double sentence combined again. Therefore, the present invention is not limited to the above short sentences.

한편, 본 발명의 실시 예를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시 예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술 분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. 이러한 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 비일시적 저장매체(non-transitory computer readable media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시 예를 구현할 수 있다. While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. The codes and code segments constituting the computer program may be easily deduced by those skilled in the art. Such a computer program may be stored in a non-transitory computer readable medium readable by a computer, readable and executed by a computer, thereby implementing an embodiment of the present invention.

여기서 비일시적 판독 가능 기록매체란, 레지스터, 캐시(cache), 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라, 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로, 상술한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리 카드, ROM 등과 같은 비일시적 판독가능 기록매체에 저장되어 제공될 수 있다.Here, the non-transitory readable recording medium is not a medium for storing data for a short time such as a register, a cache, a memory, etc., but means a medium which semi-permanently stores data and can be read by a device . Specifically, the above-described programs may be stored in non-volatile readable recording media such as CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM,

이상에서는 본 발명의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.While the invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention.

100: 사용자 장치 110: 통신망
120: 자연어 처리 장치 200: 통신 인터페이스부
210, 300: 자연어 처리부 (모듈) 220, 310: 저장부(모듈)
300-1: 개체명 인식 실행부 300-2: 통계적 번역 실행부
300-3: 개체명 인식 모델 300-5: 개체명 인식 훈련부
300-7: 통계적 번역 모델 300-9: 통계적 번역 훈련부100: user equipment 110:
120: Natural language processing device 200: Communication interface part
210, 300: natural language processing unit (module) 220, 310: storage unit (module)
300-1: entity name recognition execution unit 300-2: statistical translation execution unit
300-3: entity name recognition model 300-5: entity name recognition training section
300-7: Statistical translation model 300-9: Statistical translation training department

Claims

In a natural language processing system,
A user apparatus for inputting a sentence of a Chinese or a Chinese sentence; And
And a natural language processing device for generating a plurality of short sentences corresponding to the middle or complex sentence using the identifier of the previously stored corpus information when the middle or complex sentence is input and providing the generated plurality of short sentences to the user device and,
The corpus information may include:
The name of the object in the sentence, the name of the object to which the word belongs, and an identifier for distinguishing the category boundary and the sentence boundary.

In the natural language processing apparatus,
A storage unit for storing corpus information including an identifier for distinguishing an object name category to which an object name word in a sentence belongs and a boundary between sentences;
A communication interface unit for receiving a sentence of a Chinese or a Chinese sentence input to the user device; And
A natural language processing unit for generating a plurality of short sentences corresponding to the middle or complex sentence using the identifier of the stored corpus information when the middle or complex sentence is inputted and providing the generated plurality of short sentences to the user apparatus;
Included natural language processing device.

3. The method of claim 2,
Wherein the natural language processing unit judges the category of the entity name word in the received Chinese sentence or the complex sentence, extracts corpus information of the storage unit including the determined entity name category, and extracts the plurality of short phrases based on the extracted corpus information And outputs the generated natural language processing result.

3. The method of claim 2,
The natural language processing unit,
An entity name recognition execution unit for determining the category of the entity name word in the middle or complex sentence; And
A statistical translation unit for converting the entity name word into the determined entity name category in the middle or complex sentence, acquiring the corpus information related to the changed entity name category, and generating the plurality of short phrases based on the acquired corpus information Execution part;
Wherein the natural language processing apparatus comprises:

5. The method of claim 4,
Wherein the statistical translation executing unit divides the Chinese or Chinese sentences based on the identifier and restores the divided Chinese or Chinese sentences to the plurality of short sentences to generate the plurality of short sentences.

3. The method of claim 2,
Wherein the natural language processing unit generates the plurality of short sentences in the same language as the middle or complex sentence and provides the generated short sentences to the user apparatus.

3. The method of claim 2,
Wherein the natural language processing unit translates the plurality of short sentences in a language other than the Chinese or Chinese sentences, and generates a translated sentence including the conjunction, and provides the translated sentence to the user device.

3. The method of claim 2,
Wherein the natural language processing unit determines that the Chinese character is a Chinese character when a plurality of short sentences connected without the conjunction at predetermined time intervals are continuously provided.

3. The method of claim 2,
Wherein the natural language processing unit generates a short sentence based on the corpus information related to the object name category to which the object name word in the inputted short or similar short sentence belongs, Wherein the natural language processing apparatus comprises:

3. The method of claim 2,
Wherein the natural language processing unit acquires the same corpus information when entity name categories of the first middle and second Chinese sentences including different entity name words coincide with each other.

3. The method of claim 2,
Wherein the identifier includes sign or bit information.

In a natural language processing method,
Storing corpus information including an entity name category to which an object name word in a sentence belongs and an identifier for distinguishing a boundary between sentences;
Receiving a sentence of a Chinese or a Chinese sentence input to the user device; And
Generating a plurality of short sentences corresponding to the middle or complex sentence using the identifier of the stored corpus information when the middle or complex sentence is inputted and providing the generated plurality of short sentences to the user apparatus;
Including natural language processing methods.

13. The method of claim 12,
Wherein providing to the user device comprises:
Determining a category of the object name word in the received Chinese or Chinese sentence;
Extracting corpus information of the storage unit including the determined entity name category; And
Generating the plurality of short sentences based on the extracted corpus information;
The natural language processing method comprising the steps of:

13. The method of claim 12,
Wherein providing to the user device comprises:
Determining a category of the object name word in the middle or complex sentence;
Changing the object name word of the Chinese or Japanese sentence into the object name category judged as the object;
Acquiring the corpus information related to the changed entity name category; And
Generating the plurality of short sentences based on the acquired corpus information;
The natural language processing method comprising the steps of:

15. The method of claim 14,
Wherein the step of generating the plurality of short sentences includes:
Dividing the Chinese or Chinese sentence based on the identifier; And
And restoring the divided Chinese sentences or plural sentences into the plurality of short sentences
The natural language processing method comprising the steps of:

13. The method of claim 12,
Wherein providing to the user device comprises:
And generating the plurality of short sentences in the same language as the middle or complex sentences, and providing the short sentences to the user apparatus.

13. The method of claim 12,
Wherein providing to the user device comprises:
Wherein when a plurality of short sentences connected without the conjunction are provided continuously in a set time interval, it is determined as the middle sentence.

13. The method of claim 12,
Wherein providing to the user device comprises:
And if the entity name categories of the first and second Chinese characters including mutually different entity name words coincide with each other, acquires the same corpora corporation information.

13. The method of claim 12,
Wherein the identifier comprises a symbol or bit information.

A computer-readable recording medium containing a program for executing a natural language processing method,
In the natural language processing method,
Storing corpus information including an entity name category to which an object name word in a sentence belongs and an identifier for distinguishing a boundary between sentences;
Receiving a sentence of a Chinese or a Chinese sentence input to the user device; And
Generating a plurality of short sentences corresponding to the middle or complex sentence using the identifier of the stored corpus information when the middle or complex sentence is inputted and providing the generated plurality of short sentences to the user apparatus;
A computer readable medium having computer readable program code embodied thereon.