KR101931714B1

KR101931714B1 - System and method for extracting named entity using similar document recommand device

Info

Publication number: KR101931714B1
Application number: KR1020160174321A
Authority: KR
Inventors: 이재안; 윤도현; 고준호; 장준환; 김현태
Original assignee: 주식회사 와이즈넛
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2018-12-26
Also published as: KR20180072007A

Abstract

본 발명은 유사문서 추천장치를 이용하여 문서로부터 개체명을 추출하는 개체명 인식시스템으로서, 개체명 추출이 요구되는 분석대상문서를 수신하는 입력부, 분석대상문서와 유사한 유사문서를 추출하는 유사문서 추천장치, 및 분석대상문서와 유사문서를 비교하여 중복된 문자열을 검출하고, 중복된 문자열 중에서 개체명을 추출하는 개체명추출부를 포함하고, 상기 유사문서 추천장치는 분석대상문서 또는 복수의 레퍼런스 문서에 포함된 단어를 형태소로 분리하는 문서분석부, 문서분석부에서 분석한 형태소를 이용하여 분석대상문서 또는 복수의 레퍼런스 문서를 벡터로 재표현하는 문서벡터화부, 복수의 레퍼런스 문서를 벡터로 재표현한 결과를 저장하는 데이터베이스, 및 분석대상문서의 벡터와 복수의 레퍼런스 문서의 벡터를 비교하여, 복수의 레퍼런스 문서 중에서 분석대상문서와 유사문서를 추출하는 비교부를 포함한다.The present invention provides an entity name recognition system for extracting an entity name from a document using a similar document recommendation apparatus, the system comprising: an input unit for receiving an analysis target document for which object name extraction is required; a similar document recommendation And an object name extracting unit for extracting an object name from the duplicated character string by comparing the document to be analyzed and a similar document to detect a duplicate character string, wherein the similar document recommendation apparatus comprises: A document analysis unit for separating the included words into morphemes, a document vectorizing unit for re-expressing the document to be analyzed or a plurality of reference documents as vectors using the morpheme analyzed by the document analyzing unit, the result of re- And a vector of the document to be analyzed and a vector of a plurality of reference documents, Among the reference document it shall include a comparison of the number of similar documents to extract and analyze the target document.

Description

TECHNICAL FIELD The present invention relates to a system and a method for recognizing an object name from a document using a similar document recommendation apparatus,

본 발명은 언어 처리 기술을 이용하여 개체명을 추출하는 방법에 관한다.The present invention relates to a method for extracting an object name using a language processing technique.

기업이 제공하는 상품과 서비스에 대한 정보는 기업 중심의 관점에서 작성된 정보이다. 그러나, 최근 다양한 소셜네트워크 서비스(Social Network Service: SNS)를 통해서 고객 중심의 경험정보가 폭발적으로 생산되고 있다. 이에 기업은 소셜네트워크 서비스를 통해 양산되는 정보를 신속 정확하게 분석하여 해당 정보 내에 반영된 고객의 목소리에 귀 기울이고 있다. Information on products and services provided by companies is information that is created from the perspective of the enterprise. However, recently, various social network services (SNS) have been exploited to generate customer-oriented experience information. Therefore, companies are rapidly and accurately analyzing information that is mass-produced through social network services and listen to the voice of customers reflected in the information.

고객 중심의 경험정보를 정확하게 분석하기 위한 기술로서 개체명 인식(Named Entity Recognizing)이 알려져 있다. 개체명은 인물, 기관, 기술 및 일련번호와 같은 제품명 등과 같이 문자와 숫자로 구성된 단어 또는 그 집합을 의미한다. 개체명 인식을 위한 기술로서 종래 대한민국 특허공개공보 제10-2009-0124980호 '개체명 사전 구축 시스템 및 구축 방법'이 있다. 그러나 종래 기술은 문서 데이터를 자연어 처리 과정을 통해 기계학습한 후 통계 기반으로 개체명을 인식하는 방법이기 때문에, 학습과정과 성능튜닝과정에 상당한 시간과 노력이 소요되어 문제되었다.Named Entity Recognizing is known as a technique for accurately analyzing customer-centric experience information. An object name is a word or a set of letters and numbers, such as a name of a person, an organ, a technique, and a product name such as a serial number. A technique for recognizing an entity name is disclosed in Korean Patent Laid-Open Publication No. 10-2009-0124980 entitled " Object Name Dictionary Construction System and Construction Method ". However, since the conventional art is a method of recognizing an entity name based on statistics after machine learning through a natural language processing process of document data, a considerable amount of time and effort are required for the learning process and the performance tuning process.

본 발명의 발명자들은 위와 같은 문제점을 해결하기 위해서 연구하고 노력한 결과 본 발명을 완성하기에 이르렀다.The inventors of the present invention have conducted research and efforts to solve the above problems, and have completed the present invention.

본 발명의 발명자들은 유사문서 추천장치를 이용하여 효과적으로 개체명을 추출하는 시스템을 제안한다. The inventors of the present invention propose a system for effectively extracting an object name using a similar document recommendation apparatus.

한편, 본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 것이다.On the other hand, other unspecified purposes of the present invention will be further considered within the scope of the following detailed description and easily deduced from the effects thereof.

이와 같은 과제를 달성하기 위하여 본 발명의 제1국면은 개체명 추출이 요구되는 분석대상문서를 수신하는 입력부;According to an aspect of the present invention, there is provided an information processing apparatus comprising: an input unit receiving an analysis target document that requires extraction of an entity name;

분석대상문서와 유사한 유사문서를 추출하는 유사문서 추천장치; 및A similar document recommendation device for extracting a similar document similar to the analysis target document; And

분석대상문서와 유사문서를 비교하여 중복된 문자열을 검출하고, 중복된 문자열 중에서 개체명을 추출하는 개체명추출부를 포함하고,And an object name extracting unit for extracting the object name from the duplicated character string by comparing the document to be analyzed and the similar document,

상기 유사문서 추천장치는The similar document recommendation apparatus

분석대상문서 또는 복수의 레퍼런스 문서에 포함된 단어를 형태소로 분리하는 문서분석부;A document analyzer for separating words included in the analysis target document or the plurality of reference documents into morphemes;

문서분석부에서 분석한 형태소를 이용하여 분석대상문서 또는 복수의 레퍼런스 문서를 벡터로 재표현하는 문서벡터화부;A document vectorization unit for re-expressing a document to be analyzed or a plurality of reference documents as a vector using a morpheme analyzed by the document analysis unit;

복수의 레퍼런스 문서를 벡터로 재표현한 결과를 저장하는 데이터베이스; 및A database for storing a result of re-rendering a plurality of reference documents as a vector; And

분석대상문서의 벡터와 복수의 레퍼런스 문서의 벡터를 비교하여, 복수의 레퍼런스 문서 중에서 분석대상문서와 유사문서를 추출하는 비교부를 포함하는 것을 특징으로 하는, 유사문서 추천장치를 이용하여 문서로부터 개체명을 추출하는 개체명 인식시스템을 제공한다.And a comparator for comparing the vector of the document to be analyzed and the vectors of the plurality of reference documents and extracting the document to be analyzed and the similar document from among the plurality of reference documents, The object recognition system of the present invention includes:

바람직한 실시예에 있어서, 상기 문서분석부는 문서를 형태소 분석하는 분석모듈;In a preferred embodiment, the document analyzer comprises: an analysis module for stemming the document;

문서에 포함된 단어의 품사를 태깅하는 태깅모듈; 및A tagging module for tagging parts of words included in a document; And

명사, 동사, 형용사. 부사를 포함하는 실질 형태소 이외의 품사를 제거하는 필터모듈을 포함하는 것이 좋다.Nouns, verbs, adjectives. It is preferable to include a filter module for removing part of speech other than a substantial morpheme including adverbs.

바람직한 실시예에 있어서, 상기 문서벡터화부는 워드 임베딩(word embedding) 기법을 이용하여 문서분석장치로부터 입력된 형태소를 벡터로 수치화하는 치환모듈; 및In a preferred embodiment, the document vectorization unit may include a substitution module for digitizing the morpheme input from the document analysis apparatus using a word embedding technique into a vector; And

치환부로부터 입력된 형태소의 벡터 수치값을 이용하여 문서 전체를 벡터로 표현하는 표현모듈을 포함하는 것이 좋다.And an expression module for expressing the entire document as a vector using the vector numerical value of the morpheme input from the substitution unit.

본 발명의 제2국면은 유사문서 추천장치가 복수의 레퍼런스 문서를 분석하여 벡터로 재표현하는 단계;A second aspect of the present invention provides a similar document recommendation apparatus comprising: analyzing a plurality of reference documents and re-expressing them as a vector;

유사문서 추천장치가 분석대상문서를 분석하여 벡터로 재표현하는 단계;The similar document recommendation apparatus analyzes the document to be analyzed and re-expresses the document as a vector;

유사문서 추천장치가 분석대상문서의 벡터와 데이터베이스에 저장된 레퍼런스문서의 벡터를 비교하여, 복수의 레퍼런스 문서 중에서 분석대상문서와 유사한 유사문서를 추출하는 단계;The similar document recommendation apparatus compares the vector of the document to be analyzed with the vector of the reference document stored in the database to extract a similar document similar to the document to be analyzed among the plurality of reference documents;

개체명 인식시스템이 분석대상문서와 유사문서를 비교하여 중복된 문자열을 검출하는 단계;Detecting an overlapping character string by comparing the document to be analyzed with a similar document by the object name recognition system;

개체명 인식시스템이 중복된 문자열 중에서 개체명의 범위를 결정하고 개체명을 추출하는 단계를 포함하는 것을 특징으로 하는, 유사문서 추천장치를 이용하여 문서로부터 개체명을 추출하는 개체명 인식방법을 제공한다.The object name recognition system includes a step of determining a range of object names in a duplicated character string and extracting the object name, wherein the object name recognition method extracts the object name from the document using the similar document recommendation apparatus .

본 발명은 유사문서 추천장치를 이용하여 효율적으로 개체명을 추출할 수 있다. 따라서, 별도의 사전 구축 절차가 필요하지 않으므로 시간과 비용을 절감하는 효과가 있다.The present invention can efficiently extract the entity name using the similar document recommendation apparatus. Therefore, there is no need of a separate pre-establishment procedure, which saves time and money.

또한, 본 발명은 유사문서 추천장치를 이용하여 도출한 유사문서를 분석대상문서와 비교함으로서, 도메인 별로 달라지는 단어의 중의성, 축약표현, 신조어 등의 언어 변화에 용이하게 대응할 수 있다. 즉, 사전에 사전과 규칙을 정의하여 개체명을 추출하는 기술에 비하여 언어 변화에 유연하게 대응할 수 있는 효과가 있다.In addition, the present invention can easily cope with a change in language such as the ambiguity, abbreviated expression, and coined word of a word, which is different for each domain, by comparing the similar document derived by using the similar document recommendation apparatus with the analysis subject document. That is, it is possible to flexibly cope with the change of language compared with the technique of extracting the object name by defining the dictionary and the rule in advance.

또한, 본 발명은 유사문서 추천장치를 이용하여 비교할 수 있는 유사문서를 선별하므로 도메인, 카테고리 별로 유사문서를 구분하여 구축해 놓을 필요가 없으므로 시간과 비용 면에서 효율적이다.In addition, since similar documents that can be compared using the similar document recommendation apparatus are selected, it is not necessary to classify similar documents for each domain and category, so that it is efficient in terms of time and cost.

또한, 본 발명은 사전에 구축된 데이터베이스 내에서 유사문서를 검색하므로 실시간으로 데이터를 신속하게 처리할 수 있는 장점이 있다.Further, the present invention has an advantage that data can be quickly processed in real time because a similar document is searched in a database built in advance.

한편, 여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급됨을 첨언한다.On the other hand, even if the effects are not explicitly mentioned here, the effect described in the following specification, which is expected by the technical features of the present invention, and its potential effects are treated as described in the specification of the present invention.

도 1은 본 발명의 개체명 인식시스템이 네트워크를 통해 유사문서 추천시스템 공조하는 실시예를 나타내는 도면이다.
도 2는 본 발명의 개체명 인식시스템이 유사문서 추천장치를 내포하는 실시예를 나타내는 도면이다.
도 3은 본 발명의 유사문서 추천장치의 바람직한 실시예를 나타내는 도면이다.
도 4는 본 발명의 개체명 추출부의 바람직한 실시예를 나타내는 도면이다.
도 5는 본 발명에서 개체명을 결정하는 바람직한 실시예를 나타내는 도면이다.
도 6 및 도 7은 문자열 색인 구조의 바람직한 실시예를 나타내는 도면이다.
첨부된 도면은 본 발명의 기술사상에 대한 이해를 위하여 참조로서 예시된 것임을 밝히며, 그것에 의해 본 발명의 권리범위가 제한되지는 아니한다.1 is a view showing an embodiment in which the entity name recognition system of the present invention cooperates with a similar document recommendation system through a network.
2 is a diagram showing an embodiment in which the entity name recognition system of the present invention includes a similar document recommendation apparatus.
3 is a diagram showing a preferred embodiment of the similar document recommendation apparatus of the present invention.
4 is a diagram showing a preferred embodiment of the object name extracting unit of the present invention.
5 is a diagram illustrating a preferred embodiment for determining entity names in the present invention.
6 and 7 are views showing a preferred embodiment of the character index structure.
It is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

본 발명을 설명함에 있어서 관련된 공지기능에 대하여 이 분야의 기술자에게 자명한 사항으로서 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may obscure the subject matter of the present invention.

도 1은 본 발명의 개체명 인식시스템이 네트워크를 통해 유사문서 추천시스템 공조하는 실시예를 나타내는 도면이다.1 is a view showing an embodiment in which the entity name recognition system of the present invention cooperates with a similar document recommendation system through a network.

도 1에서 알 수 있듯이, 본 발명의 개체명 인식시스템(100)은 네트워크를 통해 유사문서 추천시스템(200)과 공조할 수 있다. 즉, 사용자 단말(10)로부터 개체명 인식을 요구하는 분석대상문서가 입수되면, 외부에 존재하는 유사문서 추천시스템(200)에게 분석대상문서를 보내고 분석대상문서와 유사한 문서를 추천받는 것이다. 이를 위해 개체명 인식시스템(100)은 유사문서 추천시스템(200)과 유선 또는 통신하기 위한 모듈을 포함할 수 있다.As can be seen from FIG. 1, the entity name recognition system 100 of the present invention can coordinate with the similar document recommendation system 200 through a network. That is, when the analysis target document requesting the entity name recognition is obtained from the user terminal 10, the analysis target document is sent to the similar document recommendation system 200 existing in the outside and the document similar to the analysis target document is recommended. For this purpose, the object name recognition system 100 may include a module for wired or communication with the similar document recommendation system 200.

한편, 다른 실시예에서 본 발명의 개체명 인식시스템(100)은 유사문서 추천시스템을 포함하도록 설계될 수 있다. 이와 같이 유사문서 추천시스템이 개체명 인식시스템 내부에 포함되는 경우, 외부에서 네트워크를 통해 공조되는 것과 구분하기 위해 유사문서 추천장치라고 호칭하도록 한다. 또한, 이하에서는 개체명 인식시스템이 유사문서 추천장치를 포함하고 있는 것에 대해서 예를 들어 설명한다.Meanwhile, in another embodiment, the entity name recognition system 100 of the present invention can be designed to include a similar document recommendation system. When the similar document recommendation system is included in the object name recognition system, it is referred to as a similar document recommendation apparatus in order to distinguish it from the outside through the network. Hereinafter, an example in which the object name recognition system includes a similar document recommendation apparatus will be described.

도 2는 본 발명의 개체명 인식시스템이 유사문서 추천장치를 내포하는 실시예를 나타내는 도면이다.2 is a diagram showing an embodiment in which the entity name recognition system of the present invention includes a similar document recommendation apparatus.

도 2에서 알 수 있듯이, 개체명 인식시스템(100)은 입력부(110), 유사문서 추천장치(120), 및 개체명 추출부(130)를 포함한다.As shown in FIG. 2, the entity name recognition system 100 includes an input unit 110, a similar document recommendation apparatus 120, and an entity name extraction unit 130.

입력부(110)는 사용자 단말(10)로부터 개체명 추출이 요구되는 분석대상문서를 수신한다. The input unit 110 receives the analysis target document from which the entity name extraction is requested from the user terminal 10. [

유사문서 추천장치(120)는 입력부(110)로부터 분석대상문서를 수신하고, 분석대상문서와 유사한 유사문서를 추출한다. 추출된 유사문서는 개체명 추출부(130)로 전달한다.The similar document recommendation apparatus 120 receives the analysis target document from the input unit 110 and extracts a similar document similar to the analysis target document. The extracted similar document is transmitted to the object name extracting unit 130.

개체명 추출부(130)는 분석대상문서와 유사문서를 비교하여 중복된 문자열을 검출하고, 중복된 문자열 중에서 개체명을 추출한다.The object name extracting unit 130 compares the document to be analyzed and a similar document to detect duplicate characters, and extracts the object names from the duplicated characters.

도 3은 본 발명의 유사문서 추천장치의 바람직한 실시예를 나타내는 도면이다.3 is a diagram showing a preferred embodiment of the similar document recommendation apparatus of the present invention.

도 3에서 알 수 있듯이, 유사문서 추천장치(120)는 문서분석부(121), 문서벡터화부(123), 데이터베이스(125), 비교부(127)를 포함한다.3, the similar document recommendation apparatus 120 includes a document analysis unit 121, a document vectorization unit 123, a database 125, and a comparison unit 127. As shown in FIG.

문서분석부(121)는 분석대상문서 또는 복수의 레퍼런스 문서에 포함된 단어를 형태소로 분리한다. 이를 위해 바람직한 실시예에서 문서분석부는 문서를 형태소 분석하는 분석모듈, 문서에 포함된 단어의 품사를 태깅하는 태깅모듈, 및 명사, 동사, 형용사. 부사를 포함하는 실질 형태소 이외의 품사를 제거하는 필터모듈을 포함할 수 있다.The document analyzer 121 separates words included in the analysis target document or the plurality of reference documents into morphemes. To this end, in a preferred embodiment, the document analyzer comprises an analysis module for stemming the document, a tagging module for tagging the part of speech contained in the document, and a noun, verb, adjective. And a filter module for removing part of speech other than the substantial morpheme including adverbs.

문서벡터화부(123)는 문서분석부에서 분석한 형태소를 이용하여 분석대상문서 또는 복수의 레퍼런스 문서를 벡터로 재표현한다. 이를 위해 바람직한 실시예에서 문서벡터화부는 워드 임베딩(word embedding) 기법을 이용하여 문서분석장치로부터 입력된 형태소를 벡터로 수치화하는 치환모듈, 및 치환부로부터 입력된 형태소의 벡터 수치값을 이용하여 문서 전체를 벡터로 표현하는 표현모듈을 포함할 수 있다.The document vectorizing unit 123 re-expresses the analysis target document or the plurality of reference documents as a vector by using the morpheme analyzed by the document analysis unit. To this end, in a preferred embodiment, the document vectorization unit includes a substitution module for digitizing the morpheme input from the document analysis apparatus using a word embedding technique, and a substitution module for substituting the entire document As a vector.

데이터베이스(125)는 복수의 레퍼런스 문서를 벡터로 재표현한 결과를 저장한다.The database 125 stores the result of re-rendering the plurality of reference documents as a vector.

비교부(127)는 분석대상문서의 벡터와 복수의 레퍼런스 문서의 벡터를 비교하여, 복수의 레퍼런스 문서 중에서 분석대상문서와 유사문서를 추출한다.The comparing unit 127 compares the vector of the document to be analyzed and the vectors of the plurality of reference documents to extract the document to be analyzed and the similar document from among the plurality of reference documents.

도 4는 본 발명의 개체명 추출부의 바람직한 실시예를 나타내는 도면이다.4 is a diagram showing a preferred embodiment of the object name extracting unit of the present invention.

도 4에서 알 수 있듯이, 개체명 추출부(130)는 문자열 확장모듈(131), 문자열 검색모듈(133), 문자열 색인모듈(135), 개체명 결정모듈(137)을 포함할 수 있다.4, the object name extraction unit 130 may include a string extension module 131, a string search module 133, a string index module 135, and an object name determination module 137.

문자열 확장모듈(131)은 분석대상문서의 처음부터 순차적으로 문자열을 확장하여 지정한다. 문자열은 시작 위치부터 한 글자씩 확장된다. 최대 문자열 길이는 사정에 지정될 수 있다. 문자열은 공백을 포함할 수 있다. The character string extension module 131 sequentially specifies a character string from the beginning of the document to be analyzed. The string is expanded by one character from the start position. The maximum string length can be specified in the condition. The string can contain spaces.

예를 들어, 분석대상문서의 첫 문장에 다음과 같은 문장이 있다고 가정한다.For example, assume that the first sentence of the document to be analyzed has the following sentence.

예시문: '5년동안 내전에 시달리는'Example: 'Five years of civil war'

문자열 확장모듈(131)은 위 예시문의 시작 문자 '5'부터 한 글자씩 차례로 문자열을 확장한다. 한 글자가 확장된 문자열은 '5년'이 되며, 두 글자가 확장된 문자열은 '5년동'이 된다. 문자열 확장모듈(131)은 해당 문자열이 유사문서에 나타나지 않을 때까지 문자열을 확장한다. 즉, 분석대상문서와 유사문서에 동일한 문자열이 있으면 확장을 계속하고, 동일한 문자열이 존재하지 않으면 확장을 중단한다. 예를 들어, 유사문서에 '5'라는 문자열이 있으면 문자열을 확장한다. 다음 문자열인 '5년'이 유사문서에 있으면 문자열을 또 확장한다. 그러나, 유사문서에 '5년동'이라는 문자열이 없으면 확장을 중단한다.The string expansion module 131 expands the character sequence one character at a time starting from the starting character '5' of the above example sentence. An extended character string is '5 years', and an extended character string is '5 years'. The string expansion module 131 expands the string until the string does not appear in the similar document. That is, if the document to be analyzed and the similar document have the same string, the extension is continued. If the same string does not exist, the expansion is stopped. For example, if the similar document contains the string '5', it extends the string. If the next string, '5 years', is in the similar document, then extend the string again. However, if the similar document does not contain the string '5 years', it stops expansion.

확장이 중단되면 다음 어절의 첫번째 문자인 '내'부터 문자열 확장을 다시 개시한다. 문자열 확장은 문장구성, 마침표와 같은 문장부호를 기준으로 중단될 수 있다. 또한, 다음 어절까지 확장되는 것이 아니라 형태소 단위로 확장의 최대범위를 설정할 수도 있다.If the expansion is interrupted, start the string expansion again from the first character of the next word, 'my'. String expansion can be aborted based on sentence structure, punctuation marks such as periods. In addition, it is possible to set the maximum extent of the extension in morpheme units, rather than expanding to the next word.

문자열 검색모듈(133)은 문자열 확장모듈(131)에 의해 결정된 문자열이 유사문서 내에 있는지 검색한다. 예를 들어, 문자열 확장모듈(131)이 결정한 문자열이 '5'라고 가정하면, 문자열 '5'가 유사문서 내에 있는지 검색한다. 만약 존재한다면, 문자열 확장모듈(131)이 확장한 문자열 '5년'이 유사문서 내에 있는지 검색한다. 문자열 검색모듈(133)은 문자열 내에 공백이 포함된 경우 공백을 제거한 후 검색결과를 제공할 수 있다.The string search module 133 searches whether the character string determined by the string string module 131 is within the similar document. For example, if the string determined by the string expansion module 131 is '5', the character string '5' is searched for in the similar document. If it exists, it searches the similar document for the string '5 years' extended by the string extension module 131. The string search module 133 may provide a search result after removing a space if a space is included in the string.

문자열 색인모듈(135)은 문자열 확장모듈(131)에서 결정된 문자열 중에서 문자열 검색모듈(133)을 통해 유사문서 내에 존재하는 문자열을 저장한다. 즉, 분석대상문서에서 결정된 문자열과 유사문서 내에 존재하는 문자열 중에서 공통되는 문자열을 저장한다. 문자열 색인모듈(135)은 문자열에 포함된 문자 하나하나를 계층적인 트리 구조로 저장한다.The string index module 135 stores a string existing in the similar document through the string search module 133 among the strings determined by the string extension module 131. [ That is, a character string common to the character string determined in the analysis target document and the character existing in the similar document is stored. The string index module 135 stores each character contained in the string in a hierarchical tree structure.

예를 들어, 문자열 확장모듈(131)과 문자열 검색모듈(133)을 통해 찾아낸 공통 문자열이 '5년', '내전', '시리아', '시리아에서', '시리아인권관측소'라고 가정한다. 문자열 '5년'은 하나의 유사문서에서 단 한 번 출현되었고, 문자열 '시리아'는 6개의 유사문서에서 21번 출현되었고, 문자열 '시리아에서'는 3개의 유사문서에서 3번 출현되었고, '시리아인권관측소'는 3개의 유사문서에서 2번 출현되었다고 가정한다. 또한, 분석대상문서에서 문자열 '5년'은 1번, '시리아'는 3번, '시리아에서'는 2번, '시리아인권관측소'는 1번 출현되었다고 가정하자. 이 경우, 문자열 색인모듈(135)은 도 6와 같이 계층적인 트리 구조형태로 문자열 색인 구조를 저장할 수 있다. 즉, 문자열 색인모듈(135)은 분석대상문서와 유사문서에 공통된 문자열을 한 글자씩 분리하여 계층적 트리구조로 배열하고, 해당 문자열에 빈도수를 표시한다. 빈도수는 다음과 같은 구조로 표시할 수 있다.For example, it is assumed that the common strings found through the string extension module 131 and the string search module 133 are '5 years', 'civil war', 'Syria', 'in Syria', and 'Syrian human rights observatory'. The string '5 years' appeared only once in one similar document, the string 'Syria' appeared 21 times in six similar documents, the string 'in Syria' appeared three times in three similar documents, Human Rights Observatory "is assumed to have appeared twice in three similar documents. Suppose that the string '5 years' is 1, 'Syria' is 3, 'Syria is 2', and 'Syrian Human Rights Observatory' appears once in the document to be analyzed. In this case, the string index module 135 can store the string index structure in the form of a hierarchical tree structure as shown in FIG. That is, the string index module 135 arranges the strings common to the analysis target document and the similar document in a hierarchical tree structure, and displays the frequency numbers in the corresponding strings. The frequency can be expressed by the following structure.

[문자열, TF/DF/QTF][String, TF / DF / QTF]

TF는 유사문서 내 문자열이 나타난 출현 빈도수, DF는 문자열이 출현한 유사문서 수, QTF는 분석대상문서 내 문자열의 출현 빈도수이다.TF is the frequency of occurrence of the string in the similar document, DF is the number of similar documents in which the string appears, and QTF is the frequency of occurrence of the string in the document to be analyzed.

개체명 결정모듈(137)은 문자열 색인모듈(135)에 의해 생성된 색일결과를 의사결정함수를 통해 가공하여 최종적으로 개체명을 결정한다. 의사결정함수는 문자열 색인모듈(135)이 생성한 색인결과에 저장된 각 문자열에 대해 개체명 가중치를 산출한다. 가중치는 분석대상문서 내 빈도수일 수 있다. 그러나 다른 실시예에서 가중치는 분석대상문서 내 빈도수 이외에도 유사문서의 빈도수, 유사문서 건수, 문자열 길이를 이용할 수도 있다.The object name determination module 137 processes the result of the color date generated by the string index module 135 through a decision function to finally determine the object name. The decision function computes the object name weight for each character string stored in the index result generated by the string index module 135. The weight may be a frequency in the document to be analyzed. However, in other embodiments, the weights may be the number of similar documents, the number of similar documents, and the length of the character string in addition to the frequency of the document to be analyzed.

바람직한 실시예에서 의사결정함수는 다음과 같다.The decision function in the preferred embodiment is as follows.

위 의사결정함수에 따르면, 가중치는 분석대상문서 내 문자열의 출현 빈도수(QTF)와 유사문서 내 출현 빈도수(TF)와 출현한 유사문서 수(DF)를 이용하여 정의한 수식의 결과값을 더한 후, 문자열의 길이(depth)의 로그값과의 곱으로 계산한다. 의사 결정 함수는 문자열이 분석대상문서 내 빈도수와 유사문서 내 출현 빈도수가 높은수록, 출현한 유사 문서 수의 DF가 낮을수록 가중치는 높아진다. 또한, 문자열의 길이가 길어질수록 가중치는 높아진다. According to the above decision function, the weight is calculated by adding the resultant value of the defined formula using the occurrence frequency (QTF) of the character string in the document to be analyzed, the occurrence frequency (TF) in the similar document and the number of similar documents (DF) And the log value of the depth of the string. The decision function increases as the number of occurrences in the similar document and the number of occurrences of similar documents decreases. In addition, the longer the length of the string, the higher the weight.

이와 같은 방법으로 가중치를 산출한 결과는 도 7과 같다. 가중치는 문자열 뒤에 표시된 숫자이다. The result of calculating the weight by the above method is shown in FIG. The weight is the number after the string.

개체명 결정모듈(137)은 도 7과 같이 문자열에 가중치를 계산하고 가중치가 높은 순으로 개체명 후보를 추출한다. 예를 들어 가중치가 높은 '시리아+인권관측소', '시리아+에서', '시리아', '5년'이 개체명 후보로 제시될 수 있다. 후처리 과정으로 개체명 후보는 '5년'과 같이 분기가 없는 문자열과 '시리아+인권관측소'와 같이 분기가 있는 문자열이 포함될 수 있다. 분리를 표시하는 '+'기호의 마지막 문자열이 조사결과에 매칭되는 경우 마지막 문자열을 생략할 수 있다. 한편, 개체명 결정모듈(137)에서 결정된 개체명은 출력부를 통해 사용자 단말에 제공될 수 있다. 제공되는 결과에는 개체명 외에도 유사문서 추천장치로부터 수신한 유사문서를 포함할 수 있다. 개체명의 가중치도 제공될 수 있으며, 가중치에 따라 내림차순 정렬될 수 있다.The object name determination module 137 calculates a weight on the string as shown in FIG. 7 and extracts the object name candidates in descending order of weight. For example, "Syria + Human Rights Observatory", "Syria +", "Syria" and "5 years" can be proposed as candidates for individual names. In the post-processing process, the candidates for the object name may include a string with no branch such as '5 years' and a string with a branch such as 'Syria + Human Rights Observatory'. If the last string in the '+' sign that indicates separation matches the search result, then the last string can be omitted. Meanwhile, the entity name determined by the entity name determination module 137 may be provided to the user terminal through the output unit. The results provided may include similar documents received from the similar document recommendation device in addition to the entity name. Weights of entity names can also be provided and can be sorted in descending order according to weights.

도 5는 본 발명에서 개체명을 결정하는 바람직한 실시예를 나타내는 도면이다.5 is a diagram illustrating a preferred embodiment for determining entity names in the present invention.

사용자 단말로부터 분석대상문서가 수신되면 유사문서 추천장치가 이를 수신하고 이와 유사한 문서를 추천한다. 이를 위해 유사문서 추천장치는 복수의 레퍼런스 문서를 분석하여 벡터로 재표현하는 단계, 분석대상문서를 분석하여 벡터로 재표현하는 단계, 분석대상문서의 벡터와 데이터베이스에 저장된 레퍼런스문서의 벡터를 비교하여, 복수의 레퍼런스 문서 중에서 분석대상문서와 유사한 유사문서를 추출하는 단계를 수행할 수 있다.When the analysis target document is received from the user terminal, the similar document recommendation device receives it and recommends similar documents. To this end, the similar document recommendation apparatus analyzes a plurality of reference documents and re-expresses them as a vector, analyzes the document to be analyzed and re-expresses it as a vector, compares the vector of the document to be analyzed and the vector of the reference document stored in the database , And extracting similar documents similar to the analysis target document from among the plurality of reference documents.

개체명 인식시스템이 유사문서를 획득하고(S1100), 분석대상문서와 유사문서를 비교하여 중복된 문자열을 검출한다. 개체명 인식시스템이 최초 문자열을 지정한다(S1200). 지정된 문자열이 존재하는지 검색(S1300)한다.The entity name recognition system obtains a similar document (S1100), and detects the duplicated character string by comparing the document to be analyzed and the similar document. The object name recognition system designates the initial character string (S1200). And searches for a specified character string (S1300).

지정된 문자열이 존재하면 문자열 색인에 저장하고, (S1400), 문자열을 한 문자만큼 확장한 후(S1410) 다시 해당 문자열이 존재하는지 검색하는 것을 반복한다(S1300).If the designated character string exists, the character string is stored in the character string index (S1400), and the character string is expanded by one character (S1410).

지정된 문자열이 존재하지 않으면 문자열을 다음 어절로 이동가능한지 판단한다(S1500). 이동가능하면 문자열을 다음 어절로 이동하고(S1510), 이동가능하지않으면 최종적으로 개체명을 결정한다(S1600). If the designated character string does not exist, it is determined whether the character string can be moved to the next word (S1500). If it is movable, the character string is moved to the next word (S1510). If the character is not movable, the object name is finally determined (S1600).

참고로, 본 발명의 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독가능매체에 기록될 수 있다. 상기 컴퓨터 판독가능매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용가능한 것일 수도 있다. 컴퓨터 판독가능매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체, 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함될 수 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급언어코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.For reference, the method of the present invention can be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs, DVDs, magneto-optical media such as floptical disks, A hard disk drive, a flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, as well as machine accords such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

본 발명의 보호범위가 이상에서 명시적으로 설명한 실시예의 기재와 표현에 제한되는 것은 아니다. 또한, 본 발명이 속하는 기술분야에서 자명한 변경이나 치환으로 말미암아 본 발명이 보호범위가 제한될 수도 없음을 다시 한 번 첨언한다.The scope of protection of the present invention is not limited to the description and the expression of the embodiments explicitly described in the foregoing. It is again to be understood that the present invention is not limited by the modifications or substitutions that are obvious to those skilled in the art.

Claims

An input unit for receiving an analysis target document for which object name extraction is required;
A similar document recommendation device for extracting a similar document similar to the analysis target document; And
And an object name extracting unit for extracting the object name from the duplicated character string by comparing the document to be analyzed and the similar document,
The similar document recommendation apparatus
A document analyzer for separating words included in the analysis target document or the plurality of reference documents into morphemes;
A document vectorization unit for re-expressing a document to be analyzed or a plurality of reference documents as a vector using a morpheme analyzed by the document analysis unit;
A database for storing a result of re-rendering a plurality of reference documents as a vector; And
And a comparison unit that compares a vector of the document to be analyzed with a vector of a plurality of reference documents to extract a document to be analyzed and a similar document from among a plurality of reference documents,
The object-
An object name is determined based on a frequency of appearance of a character string in a document to be analyzed, an appearance frequency of the character string in the similar document, a number of similar documents in which the character string appears, and a length of the character string, Wherein the similar document recommendation apparatus is characterized in that the appearance frequency of the character string in the document is inversely proportional to the appearance frequency of the character string in the similar document and the number of similar documents in which the character string appears, An object name recognition system for extracting object names.

The method according to claim 1,
The document analysis unit
An analysis module for stemming documents;
A tagging module for tagging parts of words included in a document; And
Nouns, verbs, adjectives. And a filter module for removing parts of speech other than the substantial morpheme including adverbs. A system for recognizing an object name for extracting an object name from a document using a similar document recommendation apparatus.

The method according to claim 1,
The document vectoring unit
A substitution module for digitizing the morpheme inputted from the document analyzing device into a vector by using a word embedding technique; And
And an expression module for expressing the entire document as a vector by using a vector numerical value of a morpheme input from the substitution unit, wherein the object name is extracted from the document using the similar document recommendation apparatus.

The similar document recommendation apparatus analyzes a plurality of reference documents and re-expresses them as a vector;
The similar document recommendation apparatus analyzes the document to be analyzed and re-expresses the document as a vector;
The similar document recommendation apparatus compares the vector of the document to be analyzed with the vector of the reference document stored in the database to extract a similar document similar to the document to be analyzed among the plurality of reference documents;
Detecting an overlapping character string by comparing the document to be analyzed with a similar document by the object name recognition system;
Wherein the object name recognition system determines a range of object names in a duplicate string and extracts an object name,
The extracting of the entity name may include extracting an entity name by a weight using an appearance frequency of a character string in the document to be analyzed, a frequency of appearance of the character string in the similar document, a number of similar documents in which the character string appears, Wherein the weight is inversely proportional to an occurrence frequency of a character string in the analysis target document, a frequency of occurrence in which the character string appears in the similar document, a length of the character string, and a number of similar documents in which the character string appears. A method for recognizing an entity name from a document using a similar document recommendation apparatus.