KR102540947B1

KR102540947B1 - A document information processing system through automatic contruction of thesaurus and a document information processing method

Info

Publication number: KR102540947B1
Application number: KR1020220098178A
Authority: KR
Inventors: 심지현; 고형석; 곽효승; 이홍재
Original assignee: (주)유알피
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2023-06-07

Abstract

A document information processing system according to an embodiment of the present invention is a document information processing system which implements a process for processing documents for a search database. The document information processing system includes: a processing module which processes the target document to obtain processing information consisting of at least one sentence; and a thesaurus module which automatically generates a thesaurus for a reference word, which is a word extracted from the processing information. The thesaurus module can automatically generate a thesaurus by calculating the degree of similarity between words using unsupervised learning.

Description

Document information processing system and document information processing method through automatic construction of thesaurus dictionary

본 발명은 유의어 사전 자동 구축을 구비한 문서 정보처리 시스템 및 문서 정보처리 방법에 대한 것이다. The present invention relates to a document information processing system and a document information processing method equipped with automatic construction of a thesaurus.

디지털시대로 전환되면서, 아날로그적인 많은 부분이 디지털로서 점점 많이 변환되고 있다. 이러한 시대적 변환은 가정, 기업, 공공기관의 환경을 많이 변화시키고 있다. 특히 최근에서는 사무실에서의 변화가 가장 극심한데, 사무실 내의 모든 문서가 페어퍼리스 정책에 따라 디지털화되어, 내부 결제 및 관리를 디지털 문서로서 유지되고 있다. With the transition to the digital age, many analog parts are increasingly being converted into digital. This change of the times is changing the environment of households, businesses, and public institutions. In particular, recently, the change in the office is the most extreme. All documents in the office are digitized according to the paperless policy, and internal payment and management are maintained as digital documents.

여기서, 디지털 문서들을 효과적으로 관리하고 검색하기 위해서는, 효과적인 데이터베이스를 구축하는 것이 중요하다. 이에, 문서를 가공하여, 문서의 키워드, 요약문을 추출하여, 정렬하는 프로세스가 필요하며, 이를 위해서 문서를 전처리하는 과정들이 필요하다. 또한, 문서를 효과적으로 분류하고 비교, 판단하기 위해서는 유사어에 대한 효과적인 정의와 관리가 매우 중요하다. 다만, 기존에는 유의어 사전은 외부 서버로부터 구해졌기 때문에, 문서 정보처리 시스템의 명확한 의도와 목적을 효과적으로 반영하지 못하는 문제가 존재하였다.Here, in order to effectively manage and search digital documents, it is important to construct an effective database. Accordingly, a process of processing documents, extracting keywords and summaries of documents, and arranging them is required. For this purpose, processes of pre-processing documents are required. In addition, effective definition and management of synonyms are very important in order to effectively classify, compare, and judge documents. However, in the past, since thesaurus was obtained from an external server, there was a problem of not effectively reflecting the clear intention and purpose of the document information processing system.

본 발명은 상술한 문제점을 해결하기 위한 것으로, 효율적인 검색 데이터베이스를 구축하기 위해서, 유의어 사전 자동 구축을 통한 문서 정보처리 시스템 및 문서 정보처리 방법을 제공하고자 한다.SUMMARY OF THE INVENTION The present invention is intended to solve the above problems, and to provide a document information processing system and document information processing method through automatic construction of a thesaurus in order to construct an efficient search database.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problems described above, and other technical problems may exist.

본 발명의 일 실시예에 따른 문서 정보처리 시스템은, 검색 데이터베이스를 위해, 문서를 가공하는 프로세스를 구현하는 문서 정보처리 시스템에 있어서, 대상 문서가 가공되어 적어도 하나의 문장으로 이루어진 가공 정보를 취득하는 가공 모듈; 상기 가공 정보로부터 추출된 단어인 기준 단어에 대해서 유의어 사전을 자동 생성하는 유의어 사전 모듈;을 포함하고, 상기 유의어 사전 모듈은, 비지도 학습을 이용하여 단어들간의 유사한 정도를 산출하여, 유의어 사전을 자동 생성할 수 있다. A document information processing system according to an embodiment of the present invention implements a process of processing a document for a search database, wherein a target document is processed to obtain processing information consisting of at least one sentence. processing module; and a thesaurus module for automatically generating a thesaurus for reference words, which are words extracted from the processed information, wherein the thesaurus module calculates a degree of similarity between words using unsupervised learning to create a thesaurus. can be automatically generated.

또한, 상기 유의어 사전 모듈은, 문장 단위로 분리하는 방법, 한자를 한글로 치환하는 방법, 낱개 자소를 제거하는 방법 및 문장 전후의 공백을 제거하는 방법 중 적어도 하나의 방법을 이용하여 상기 가공 정보를 전처리하는 전처리부, 상기 가공 정보로부터 전처리되어 생성된 상기 기준 단어에 대해서 단어들 간의 유사한 정도를 산출하는 단어 유사도 산출부 및 상기 단어 유사도 산출부에서 산출한 단어들 간의 유사한 정도를 기초로 상기 기준 단어를 유의어 사전에 추가하는 사전 구축부를 구비할 수 있다. In addition, the thesaurus module extracts the processed information by using at least one of a method of dividing into sentence units, a method of replacing Chinese characters with Korean characters, a method of removing individual grapheme, and a method of removing spaces before and after sentences. A preprocessor for preprocessing, a word similarity calculation unit for calculating a degree of similarity between words with respect to the reference word generated by preprocessing from the processed information, and the reference word based on the degree of similarity between words calculated by the word similarity calculation unit It may be provided with a dictionary construction unit that adds a thesaurus.

또한, 상기 단어 유사도 산출부는, 단어들 간의 유사한 정도가 반영되도록, 상기 기준 단어를 벡터 형태로 변환할 수 있다. In addition, the word similarity calculation unit may convert the reference word into a vector form so that a degree of similarity between words is reflected.

또한, 상기 유의어 사전 모듈은, 단어들간 유사한 정도를 고려하여 상기 기준 단어를 Word2Vec로 변환하는 단어 유사도 모델을 생성하는 단어 유사도 학습부를 더 구비하고, 상기 단어 유사도 학습부는, 비지도 학습을 통해 상기 단어 유사도 모델을 생성할 수 있다. The thesaurus module further includes a word similarity learning unit generating a word similarity model for converting the reference word into Word2Vec in consideration of a degree of similarity between words, wherein the word similarity learning unit performs unsupervised learning on the word A similarity model can be created.

또한, 상기 사전 구축부는, 학습된 단어 전체를 로드하여 상기 기준 단어와의 유사한 정도를 기초로 상기 유의어 사전을 생성할 수 있다. In addition, the dictionary builder may load all learned words and generate the thesaurus based on a degree of similarity with the reference word.

또한, 상기 사전 구축부는, 인코딩 상태로 학습된 단어 전체 목록을 로드 후에 디코딩하여 상기 기준 단어와의 유사한 정보를 기초로 상기 유의어 사전을 생성할 수 있다. In addition, the dictionary builder may generate the thesaurus based on information similar to the reference word by decoding the entire list of words learned in an encoded state after loading.

또한, 상기 가공 정보로부터 요약문을 추출하는 요약 추출 모듈; 및 상기 가공 정보로부터 중요 키워드를 추출하는 키워드 추출 모듈;을 더 포함할 수 있다. In addition, a summary extraction module for extracting a summary sentence from the processing information; and a keyword extraction module for extracting important keywords from the processing information.

또한, 상기 요약 추출 모듈은, 상기 가공 정보로부터 문장마다 중요도를 산출하는 문장 중요도 판단부, 문장의 중요도를 고려하여 상기 가공 정보로부터 요약문을 추출하는 요약문 추출 및 상기 대상 문서 상에서 상기 가공 정보의 중요도를 판단하는 중요도 판단모델을 생성하는 문장 모델 생성부를 더 구비하고, 상기 문장 모델 생성부는, 기계 학습에 의해 상기 중요도 판단모델을 생성할 수 있다. In addition, the summary extraction module includes a sentence importance determination unit for calculating the importance of each sentence from the processed information, summary sentence extraction for extracting a summary from the processed information in consideration of the importance of the sentence, and the importance of the processed information on the target document. The method may further include a sentence model generation unit that generates an importance determination model for determining importance, and the sentence model generation unit may generate the importance determination model by machine learning.

또한, 상기 요약 추출 모듈은, 사전 검증 문장을 이용하여 상기 중요도 판단모델들의 요약 정확도를 산출하여, 요약 정확도를 기초로 복수의 상기 중요도 판단모델들 중에서 하나의 중요도 판단모델을 선택하는 문장 모델 선택부를 더 구비하고, 상기 사전 검증 문장은, 문서 형식에 따라 분류되며, 상기 문장 모델 선택부는, 문서 형식에 따라 상기 사전 검증 문장을 선택할 수 있다. In addition, the summary extraction module calculates summary accuracy of the importance judgment models using a pre-verification sentence, and selects a sentence model selection unit from among a plurality of importance judgment models based on the summary accuracy. The pre-verification sentences are further provided, and the sentence model selection unit may select the pre-verification sentences according to the document format.

본 발명의 일 실시예에 따른 문서 정보처리 방법은, 문서 정보처리 시스템에 의해 구현되는 문서 정보처리 방법에 있어서, 가공 모듈에 의해 대상 문서가 가공되어 적어도 하나의 문장으로 이루어진 가공 정보가 취득되는 단계인 취득 단계; 및 상기 가공 정보로부터 추출된 단어인 기준 단어에 대해서 유의어 사전이 자동 생성되는 단계;를 포함하고, 상기 유의어 사전은, 유의어 사전 모듈이 비지도 학습을 이용하여 단어들간의 유사한 정도를 산출하여 자동 생성될 수 있다.A document information processing method according to an embodiment of the present invention is a document information processing method implemented by a document information processing system, comprising the steps of processing a target document by a processing module to obtain processing information consisting of at least one sentence. phosphorus acquisition step; and automatically generating a thesaurus for the reference word, which is a word extracted from the processed information, wherein the thesaurus module calculates a degree of similarity between words using unsupervised learning, and automatically generates the thesaurus. It can be.

본 발명에 따른 유의어 사전 자동 구축을 통한 문서 정보처리 시스템 및 문서 정보처리 방법은 유의어 사전을 효과적으로 구축할 수 잇다. The document information processing system and document information processing method through automatic construction of a thesaurus according to the present invention can effectively build a thesaurus.

또한, 문서 관리 효율성을 극대화할 수 있다. In addition, document management efficiency can be maximized.

또한, 데이터베이스의 관리 효율성을 극대화할 수 있다. In addition, the management efficiency of the database can be maximized.

또한, 사용의 편의성을 극대화할 수 있다.In addition, convenience of use can be maximized.

다만, 본 발명의 효과가 상술한 효과들로 제한되는 것은 아니며, 언급되지 아니한 효과들은 본 명세서 및 첨부된 도면으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확히 이해될 수 있을 것이다.However, the effects of the present invention are not limited to the above-mentioned effects, and effects not mentioned will be clearly understood by those skilled in the art from this specification and the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 문서 정보처리 시스템의 관계도
도 2는 본 발명의 일 실시예에 따른 문서 정보처리 시스템의 구성도
도 3은 본 발명의 일 실시에에 따른 문서 정보처리 시스템이 구비하는 유의어 사전 모듈에 의해 구현되는 과정을 설명하기 위한 도면
도 4은 본 발명의 일 실시예에 따른 문서 정보처리 시스템이 구비하는 요약 추출 모듈에 의해 구현되는 과정을 설명하기 위한 도면
도 5는 본 발명의 일 실시예에 따른 문서 정보처리 시스템이 구비하는 키워드 추출 모듈에 의해 구현되는 과정을 설명하기 위한 도면1 is a relationship diagram of a document information processing system according to an embodiment of the present invention
Figure 2 is a block diagram of a document information processing system according to an embodiment of the present invention
3 is a diagram for explaining a process implemented by a thesaurus module included in a document information processing system according to an embodiment of the present invention.
4 is a diagram for explaining a process implemented by a summary extraction module provided in a document information processing system according to an embodiment of the present invention.
5 is a diagram for explaining a process implemented by a keyword extraction module provided in a document information processing system according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 제시되는 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 발명이나 본 발명 사상의 범위 내에 포함되는 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본원 발명 사상 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the presented embodiments, and those skilled in the art who understand the spirit of the present invention may add, change, delete, etc. other elements within the scope of the same spirit, through other degenerative inventions or the present invention. Other embodiments included within the scope of the inventive idea can be easily proposed, but it will also be said to be included within the scope of the inventive concept.

도 1은 본 발명의 일 실시예에 따른 문서 정보처리 시스템의 관계도이고, 도 2는 본 발명의 일 실시예에 따른 문서 정보처리 시스템의 구성도이다.1 is a relationship diagram of a document information processing system according to an embodiment of the present invention, and FIG. 2 is a configuration diagram of a document information processing system according to an embodiment of the present invention.

도 1 및 도 2를 참조하면, 본 발명의 일 실시예에 따른 문서 정보처리 시스템(100)은 외부 서버(200)로부터 문서 데이터(이하, 문서라고 칭함)를 수집할 수 있다. Referring to FIGS. 1 and 2 , the document information processing system 100 according to an embodiment of the present invention may collect document data (hereinafter referred to as a document) from an external server 200 .

일례로, 외부 서버(200)는 공공 기관 서버, 기업 서버 및/또는 개인 서버 일 수 있다. For example, the external server 200 may be a public institution server, a corporate server, and/or a personal server.

다만, 이에 한정하지 않고, 상기 외부 서버(200)의 종류는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and the type of the external server 200 can be variously modified at a level obvious to those skilled in the art.

일례로, 외부 서버(200)는 문서를 수집하고, 데이터 처리를 하여 가공 정보를 산출하는 서버를 의미할 수 있다. As an example, the external server 200 may refer to a server that collects documents and processes data to calculate processing information.

본 발명에서 언급하는 서버는 서버의 서버 환경을 수행하기 위한 다른 구성들이 포함될 수도 있다. 서버는 임의의 형태의 장치는 모두 포함할 수 있다. The server referred to in the present invention may include other configurations for performing the server environment of the server. The server may include any type of device.

일례로, 서버는 디지털 기기로서, 랩탑 컴퓨터, 노트북 컴퓨터, 데스크톱 컴퓨터, 웹 패드, 이동 전화기와 같이 프로세서를 탑재하고 메모리를 구비한 연산 능력을 갖춘 디지털 기기일 수 있다. As an example, the server may be a digital device, such as a laptop computer, a notebook computer, a desktop computer, a web pad, or a mobile phone, which includes a processor and a memory and arithmetic capability.

일례로, 서버는 웹 서버일 수 있다. 다만, 이에 한정하지 않고, 서버의 종류는 통상의 기술자에게 자명한 수준에서 다양하게 변경 가능하다.As an example, the server may be a web server. However, it is not limited to this, and the type of server can be variously changed at a level obvious to those skilled in the art.

문서는 기관 내에서 전자적으로 작성되거나 관리되는 문서일 수 있다. Documents may be electronically created or managed documents within an institution.

일례로, 문서는 내부 결제 서류, 업무 진행 상황들이 누적된 보고 서류 등을 포함할 수 있다. For example, the document may include internal payment documents, report documents in which business progress is accumulated, and the like.

다만, 이에 한정하지 않고, 상기 문서의 종류는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited to this, and the type of the document can be variously modified at a level obvious to those skilled in the art.

문서 정보처리 시스템(100)은 외부 서버(200)와 네트워크 망에 의해 정보 통신 가능하게 연결될 수 있다.The document information processing system 100 may be connected to the external server 200 through a network to enable information communication.

전자 문서에는 각종 텍스트 등 자료가 포함된 문서들일 수 있고, 각종 행정 문서, 보고서, 논문 및 평가서 등 다양한 포맷일 수 있고, odt, pdf, ppt, pptx, xls, xlsx, doc, docx, hwp 및 hwpx 등 파일형식을 갖출 수 있으며 데이터베이스 상에 저장되어 있을 수 있다.Electronic documents may be documents containing data such as various texts, and may be in various formats such as various administrative documents, reports, theses, and evaluations, and may include odt, pdf, ppt, pptx, xls, xlsx, doc, docx, hwp, and hwpx. etc., and may be stored in a database.

본 발명에서 언급하는 온라인 네트워크라 함은 유선 공중망, 무선 이동 통신망, 또는 휴대 인터넷 등과 통합된 코어 망일 수도 있고, TCP/IP 프로토콜 및 그 상위 계층에 존재하는 여러 서비스, 즉 HTTP(Hyper Text Transfer Protocol), HTTPS(Hyper Text Transfer Protocol Secure), Telnet, FTP(File Transfer Protocol), DNS(Domain Name System), SMTP(Simple Mail Transfer Protocol) 등을 제공하는 전 세계적인 개방형 컴퓨터 네트워크 구조를 의미할 수 있으며, 이러한 예에 한정하지 않고 다양한 형태로 데이터를 송수신할 수 있는 데이터 통신망을 포괄적으로 의미하는 것이다.The online network referred to in the present invention may be a core network integrated with a wired public network, a wireless mobile communication network, or a mobile Internet, etc. , HTTPS (Hyper Text Transfer Protocol Secure), Telnet, FTP (File Transfer Protocol), DNS (Domain Name System), SMTP (Simple Mail Transfer Protocol), etc. It comprehensively means a data communication network capable of transmitting and receiving data in various forms without being limited thereto.

본 발명의 일 실시예에 따른 문서 정보처리 시스템(100)은 검색 데이터베이스를 위해, 문서를 가공하는 프로세스를 구현하는 문서 정보처리 시스템(100)에 있어서, 대상 문서가 가공되어 적어도 하나의 문장으로 이루어진 가공 정보를 취득하는 가공 모듈(110), 상기 가공 정보로부터 요약문을 추출하는 요약 추출 모듈(120) 및 상기 가공 정보로부터 중요 키워드를 추출하는 키워드 추출 모듈(130)을 포함할 수 있다. In the document information processing system 100 according to an embodiment of the present invention, in the document information processing system 100 implementing a process of processing documents for a search database, a target document is processed to consist of at least one sentence. It may include a processing module 110 that acquires processed information, a summary extraction module 120 that extracts a summary from the processed information, and a keyword extraction module 130 that extracts important keywords from the processed information.

또한, 상기 가공 정보로부터 추출된 단어인 기준 단어에 대해서 유의어 사전을 자동 생성하는 유의어 사전 모듈(140)을 더 포함할 수 있다. In addition, a thesaurus module 140 may be further included to automatically generate a thesaurus for reference words that are words extracted from the processing information.

가공 모듈(110)은 외부 서버(200)와 정보 통신 가능하게 연결되어, 가공 정보를 요청하고 수신할 수 있다. The processing module 110 may be connected to the external server 200 for information communication and request and receive processing information.

일례로, 상기 가공 모듈(110)은 통신 모듈일 수 있다. For example, the processing module 110 may be a communication module.

일례로, 통신 모듈은 셀룰러 모듈, WiFi 모듈, 블루투스 모듈, GNSS 모듈, NFC 모듈, RF 모듈, 5G 모듈, LTE 모듈, NB-IOT 모듈 및/또는 LoRa 모듈을 포함할 수 있다.As an example, the communication module may include a cellular module, a WiFi module, a Bluetooth module, a GNSS module, an NFC module, an RF module, a 5G module, an LTE module, a NB-IOT module, and/or a LoRa module.

다만, 이에 한정하지 않고, 통신 모듈이 포함하는 모듈은 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited to this, and the module included in the communication module can be variously modified at a level obvious to those skilled in the art.

가공 정보는 여러 개의 문장들로 이루어진 정보일 수 있다. Processing information may be information composed of several sentences.

여기서, 여러 개의 문장들을 하나의 문서의 요약문일 수도 있고, 일부 추출된 문장들일 수 도 있다. Here, several sentences may be summaries of one document or may be partially extracted sentences.

유의어 사전 모듈(140)은 문장 단위로 분리하는 방법, 한자를 한글로 치환하는 방법, 낱개 자소를 제거하는 방법 및 문장 전후의 공백을 제거하는 방법 중 적어도 하나의 방법을 이용하여 상기 가공 정보를 전처리하는 전처리부(141), 상기 가공 정보로부터 전처리되어 생성된 상기 기준 단어에 대해서 단어들 간의 유사한 정도를 산출하는 단어 유사도 산출부(143) 및 상기 단어 유사도 산출부(143)에서 산출한 단어들 간의 유사한 정도를 기초로 상기 기준 단어를 유의어 사전에 추가하는 사전 구축부(145)를 구비할 수 있다. The thesaurus module 140 pre-processes the processed information using at least one of a method of separating into sentence units, a method of replacing Chinese characters with Korean characters, a method of removing single grapheme, and a method of removing spaces before and after sentences. A pre-processing unit 141 that calculates the degree of similarity between words with respect to the reference word generated by pre-processing from the processed information A dictionary builder 145 may be provided to add the reference word to a thesaurus based on similarity.

또한, 상기 유의어 사전 모듈(140)은 상기 기준 단어에 태그를 부착하는 태그부(142) 및 단어들간 유사한 정도를 고려하여 상기 기준 단어를 Word2Vec로 변환하는 단어 유사도 모델을 생성하는 단어 유사도 학습부(144)를 더 구비할 수 있다. In addition, the thesaurus module 140 includes a tag unit 142 that attaches a tag to the reference word and a word similarity learning unit that generates a word similarity model that converts the reference word into Word2Vec in consideration of similarity between words ( 144) may be further provided.

전처리부(141)는 문서를 문장 단위로 각각 분리할 수 있다. The pre-processing unit 141 may separate each document into sentence units.

전처리부(141)는 가공 정보를 문장 단위로 분리하고, 한자, 영어, 일어 등의 외국문자를 한글로 치환할 수 있다. The pre-processing unit 141 may separate processed information into sentence units and replace foreign characters such as Chinese characters, English characters, and Japanese characters with Korean characters.

또한, 전처리부(141)는 가공 정보에서 'ㅇ', 'ㅁ'과 같은 낱개 자소를 제거할 수 있으며, 문장 전후의 공백을 제거하여 가공 정보를 전처리할 수 있다. In addition, the pre-processing unit 141 may remove individual grapheme such as 'ㅇ' and 'ㅁ' from processed information, and may pre-process processed information by removing spaces before and after sentences.

다만, 이에 한정하지 않고, 이 외에 필요한 전처리 조건이 있을 경우 추가할 수 있다. However, it is not limited to this, and if there are other necessary pretreatment conditions, they may be added.

태그부(142)는 전처리된 정보에서 형태소 분리를 하여 기준 단어를 생성할 수 있있다. The tag unit 142 may generate a reference word by separating morphemes from preprocessed information.

또한, 태그부(142)는 기준 단어에 개체명 태그를 부착할 수 있다. Also, the tag unit 142 may attach an object name tag to the reference word.

태그부(142)는 길이 3000 이상인 문장은 제거할 수 있다. The tag unit 142 may remove sentences having a length of 3000 or more.

이는, 길이 3000이상일 경우 대부분 표, 이미지 등의 문장 형태이기 때문일 수 있다. This may be because most of the sentences with a length of 3000 or more are in the form of sentences such as tables and images.

일례로, 태그부(142)는 형태소 분석기로서 Nori, Mecab, khaiii, Okt 등 적절한 것을 선택하여 사용할 수 잇다. For example, the tag unit 142 may select and use appropriate ones such as Nori, Mecab, khaiii, and Okt as morpheme analyzers.

태그부(142)는 형태소 태그 및/또는 개체명 태그가 부착된 기준 단어를 정수 인코딩할 수 있다. The tag unit 142 may integer-encode the reference word to which the morpheme tag and/or the entity name tag are attached.

여기서, 인코딩에 사용된 정수 인덱스는 태그부(142)에 미리 저장되어 있을 수 있다. Here, the integer index used for encoding may be previously stored in the tag unit 142 .

단어 유사도 산출부(143)는 단어들 간의 유사한 정도가 반영되도록, 인코딩된 기준 단어를 벡터 형태로 변환할 수 있다. The word similarity calculation unit 143 may convert the encoded reference word into a vector form so that the degree of similarity between words is reflected.

일례로, 단어 유사도 산출부(143)는 기준 단어를 word2Vec로서 생성할 수 있다. For example, the word similarity calculation unit 143 may generate a reference word as word2Vec.

word2Vec로 생성하는 알고리즘은 기존의 공지된 기술을 활용하는 것으로서, 이에 대한 자세한 설명은 생략될 수 있다. The algorithm generated by word2Vec utilizes an existing known technology, and a detailed description thereof may be omitted.

여기서, 단어 유사도 산출부(143)는 비트수는 최소 3 이상으로 지정할 수 있다. Here, the word similarity calculation unit 143 may designate the number of bits as at least 3 or more.

여기서, 추가학습을 위해 비트수는 고정값으로 지정될 수 있다. Here, the number of bits may be designated as a fixed value for additional learning.

단어 유사도 학습부(144)는 비지도 학습을 통해 상기 단어 유사도 모델을 생성할 수 있다. The word similarity learning unit 144 may generate the word similarity model through unsupervised learning.

상기 단어 유사도 학습부(144)는 기존에 생성된 단어 유사도 모델을 로드하여 태그가 부착된 기준 단어를 입력하여 재 학습할 수 있다. The word similarity learning unit 144 may load a previously generated word similarity model, input a tagged reference word, and perform re-learning.

학습 파라미터는 기존 모델과 동일하게 설정될 수 있다. Learning parameters may be set identically to existing models.

비지도 학습에 대한 알고리즘은 공지된 기술을 사용하는 것으로서, 이에 대한 자세한 설명은 생략될 수 있다. An algorithm for unsupervised learning uses a known technique, and a detailed description thereof may be omitted.

사전 구축부(145)는 학습된 단어 전체를 로드하여 상기 기준 단어와의 유사한 정도를 기초로 상기 유의어 사전을 생성할 수 있다. The dictionary builder 145 may load all learned words and generate the thesaurus based on the degree of similarity with the reference word.

사전 구축부(145)는 인코딩 상태로 학습된 단어 전체 목록을 로드 후에 디코딩하여 상기 기준 단어와의 유사한 정보를 기초로 상기 유의어 사전을 생성할 수 있다. The dictionary builder 145 may generate the thesaurus based on information similar to the reference word by decoding the entire list of words learned in an encoded state after loading.

사전 구축부(145)는 정수 인코딩 상태로 학습된 단어 목록을 전체 로드할 수 있다. The dictionary construction unit 145 may load the entire learned word list in an integer encoding state.

정수 인코딩된 상태로 과거의 기준 단어들이 학습되었으며, 학습된 후의 기준 단어들의 정수 인코딩을 태그부(142)에 저장될 수 잇다. Past reference words were learned in an integer-encoded state, and integer encoding of the learned reference words may be stored in the tag unit 142 .

사전 구축부(145)는 태그부(142)에 저장된 단어들의 정수 인코딩을 전부 로드한 후 디코딩할 수 있다. The dictionary construction unit 145 may load all integer encodings of words stored in the tag unit 142 and then decode them.

사전 구축부(145)는 정수 인코딩으로 표현된 단어 목록을 디코딩할 수 있다. The dictionary builder 145 may decode a word list expressed in integer encoding.

사전 구축부(145)는 기준 단어 중에서 지정된 불용 형태소 태그가 붙은 형태소를 제외할 수 잇다. The dictionary constructing unit 145 may exclude morphemes tagged with designated unused morphemes from reference words.

일례로, Nori 형태소 분석기 기준으로 불용 형태소 목록은 'UNKNOWN', 'JKB', 'JX', 'JKO', 'JKS', 'JKQ', 'JKG', 'JKC', 'JKV', 'JC', 'EC', 'EF', 'EP', 'ETN', 'ETM', 'XPN', 'XSN', 'XSV', 'XSA', 'MAG', 'MAJ', 'IC', 'VX', 'VCP', 'VCN', 'NR', 'NP', 'NNBC', 'NNB', 'SF', 'SE', 'SSO', 'SSC', 'SC', 'SY', 'SL', 'SN'일 수 있다. 다만, 이에 한정하지 않고, 불용 형태소 목록은 채택되는 형태소 분석기에 따라 통상의 기술자에게 자명한 수준에서 변형 될 수 있다. For example, based on the Nori morpheme analyzer, the list of unused morphemes is 'UNKNOWN', 'JKB', 'JX', 'JKO', 'JKS', 'JKQ', 'JKG', 'JKC', 'JKV', 'JC ', 'EC', 'EF', 'EP', 'ETN', 'ETM', 'XPN', 'XSN', 'XSV', 'XSA', 'MAG', 'MAJ', 'IC', 'VX', 'VCP', 'VCN', 'NR', 'NP', 'NNBC', 'NNB', 'SF', 'SE', 'SSO', 'SSC', 'SC', 'SY' ', 'SL', or 'SN'. However, without being limited thereto, the list of disused morphemes may be modified at a level obvious to those skilled in the art according to the morpheme analyzer adopted.

사전 구축부(145)는 지정된 불용 개체명 태그가 붙은 형태소를 제외할 수 있다.The dictionary constructing unit 145 may exclude morphemes tagged with a designated unusable entity name.

또한, 사전 구축부(145)는 한 글자로 된 형태소를 제외시킬 수 있다. In addition, the dictionary construction unit 145 may exclude single-letter morphemes.

다만, 이에 한정하지 않고, 이외에 필요한 불용 조건이 있을 경우 추가될 수 있다. However, it is not limited to this, and may be added if there are other necessary disuse conditions.

사전 구축부(145)는 기준 단어와 가장 유사한 단어 벡터들을 추출할 수 있다. The dictionary builder 145 may extract word vectors most similar to the reference word.

일례로, 사전 구축부(145)는 word2Vec를 이용하여 기준 단어와 가장 유사한 단어 벡터 10개를 추출할 수 있다. For example, the dictionary builder 145 may extract 10 word vectors most similar to the reference word using word2Vec.

여기서, 사전 구축부(145)는 유사 단어 목록 중 코사인 유사도가 0.5미만인 단어는 제외시킬 수 있다. Here, the dictionary builder 145 may exclude words having a cosine similarity of less than 0.5 from the list of similar words.

또한, 사전 구축부(145)는 유사 단어 목록 중 불용 형태소에 해당하는 단어는 제외시킬 수 있다. In addition, the dictionary builder 145 may exclude words corresponding to disused morphemes from the list of similar words.

여기서, 추출 단어 벡터의 개수와 코사인 유사도 기준은 결과에 따라서 조정될 수 있다. Here, the number of extracted word vectors and the criterion for cosine similarity may be adjusted according to the result.

사전 구축부(145)는 기준 단어와 유사한 단어들을 하나의 군으로 조정하여 유의어 사전을 새롭게 업데이트 시킬 수 있다. The dictionary builder 145 may newly update the thesaurus by adjusting words similar to the reference word into one group.

요약 추출 모듈(120)은 상기 가공 정보로부터 문장마다 중요도를 산출하는 문장 중요도 판단부(121) 및 문장의 중요도를 고려하여 상기 가공 정보로부터 요약문을 추출하는 요약문 추출부(122)를 구비할 수 있다. The summary extraction module 120 may include a sentence importance determiner 121 that calculates the importance of each sentence from the processed information and a summary sentence extractor 122 that extracts the summary from the processed information in consideration of the importance of the sentence. .

또한, 상기 요약 추출 모듈(120)은 상기 대상 문서 상에서 상기 가공 정보의 중요도를 판단하는 중요도 판단모델을 생성하는 문장 모델 생성부(123), 사전 검증 문장을 이용하여 상기 중요도 판단모델들의 요약 정확도를 산출하여, 요약 정확도를 기초로 복수의 상기 중요도 판단모델들 중에서 하나의 중요도 판단모델을 선택하는 문장 모델 선택부(124) 및 상기 사전 검증 문장의 갱신 여부를 판단하는 문장 피드백부(125)를 더 구비할 수 있다. In addition, the summary extraction module 120 determines the summary accuracy of the importance judgment models by using the sentence model generation unit 123 for generating an importance judgment model for determining the importance of the processed information on the target document and a pre-verification sentence. The sentence model selection unit 124 for calculating and selecting one of the plurality of importance judgment models based on the summary accuracy and the sentence feedback unit 125 for determining whether to update the pre-verified sentence are further provided. can be provided

또한, 상기 요약 추출 모듈(120)은 상기 문장 중요도 판단부(121)에서 판단된 문장 중요도를 공통의 기준으로 정규화하는 문장 정규화부(126) 및 중요도 태그를 기반으로 문장별 중요도 순위를 책정하는 문장 순위부(127)를 더 구비할 수 있다. In addition, the summary extraction module 120 includes a sentence normalization unit 126 that normalizes the sentence importance determined by the sentence importance determination unit 121 based on a common criterion and sentences that rank the importance of each sentence based on the importance tag. A ranking unit 127 may be further provided.

문장 모델 생성부(123)는 기계 학습에 의해 중요도 판단모델을 생성할 수 있다. The sentence model generation unit 123 may generate an importance judgment model through machine learning.

중요도 판단모델은 복수개일 수 있다. There may be a plurality of importance judgment models.

일례로, 중요도 판단모델은 'SBERT', 'TextRank' 등 일 수 있다. For example, the importance judgment model may be 'SBERT' or 'TextRank'.

다만, 이에 한정하지 않고, 상기 중요도 판단모델의 종류는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and the type of the importance judgment model can be variously modified at a level obvious to those skilled in the art.

기계 학습은 딥러닝을 포함하는 개념일 수 있다. Machine learning can be a concept that includes deep learning.

여기서, 딥러닝은 출력 레이어의 레이블 된 데이터(labeled data)를 이용하여 신경망의 가중치(weight)를 업데이트하는 알고리즘인 백 프로파게이션(Back Progation) 알고리즘을 이용한 것일 수 있으나, 이에 한정하는 것은 아니다.Here, deep learning may be performed using a back propagation algorithm, which is an algorithm for updating weights of a neural network using labeled data of an output layer, but is not limited thereto.

여기서, 심층 신경망 및 백 프로파게이션(Back Progation) 알고리즘은 종래에 공지된 바와 같으므로, 이에 대한 구체적인 설명은 생략될 수 있다.Here, since the deep neural network and the back propagation algorithm are conventionally known, a detailed description thereof may be omitted.

문장 모델 생성부(123)가 중요도 판단모델을 생성하기 위해서 수많은 문서, 문서를 가공한 가공 정보 및 가공 정보 내의 문장별 중요도 랭크들에 대한 데이터들이 활용될 수 있다. In order for the sentence model generation unit 123 to create an importance determination model, a number of documents, processed information obtained by processing the documents, and data on importance ranks for each sentence in the processed information may be utilized.

문장 모델 선택부(124)는 사전 검증 문장을 이용해서 문장 중요도 판단부(121)가 사용할 중요도 판단모델을 선택할 수 있다.The sentence model selection unit 124 may select an importance determination model to be used by the sentence importance determination unit 121 by using the pre-verification sentence.

사전 검증 문장은 문서 전문과 해당 문서 전문의 요약문이 하나의 군으로 이루어지며, 복수개의 군들로서 이루어질 수 있다. The pre-verification sentence consists of a whole document and a summary of the full text of the document as one group, and may be composed of a plurality of groups.

여기서, 상기 사전 검증 문장은 문서 형식에 따라 분류될 수 있다. Here, the pre-verification sentences may be classified according to document types.

일례로, 문서가 논문일 경우의 사전 검증 문장과 문서가 보고서일 경우의 사전 검증 문장이 서로 상이할 수 있다. For example, a pre-verification sentence when the document is a thesis and a pre-verification sentence when the document is a report may be different from each other.

또한, 사전 검증 문장은 문서 전문 중 발췌 문장들로 이루어진 가공 정보와 가공 정보 내의 요약문이 하나의 군으로 이루어질 수 있으며, 복수개의 군들로서 이루어질 수 있다. In addition, the pre-verification sentence may consist of processed information consisting of excerpts from the full text of the document and a summary within the processed information as one group or as a plurality of groups.

문장 모델 선택부(124)는 중요도 판단모델을 통해 사전 검증 문장으로부터 소정 기준의 중요도를 가지는 문장을 추출하고, 그 추출된 문장과 사전 검증 문장 내의 요약문과 비교하여 '요약 정확도'를 산출할 수 있다.The sentence model selector 124 may extract sentences having a predetermined criterion of importance from the pre-verification sentences through the importance judgment model, and compare the extracted sentences with summary sentences in the pre-verification sentences to calculate 'summarization accuracy'. .

문장 모델 선택부(124)는 '요약 정확도'가 가장 높은 중요도 판단모델을 선택할 수 있다. The sentence model selector 124 may select an importance judgment model having the highest 'summarization accuracy'.

문장 모델 선택부(124)는 문서 형식에 따라 상기 사전 검증 문장을 선택할 수 있다. The sentence model selection unit 124 may select the pre-verification sentence according to the document format.

문장 중요도 판단부(121)는 문장 모델 선택부(124)에서 선택된 중요도 판단모델을 이용하여, 가공 정보 내의 문장들의 중요도를 판단할 수 있다. The sentence importance determining unit 121 may determine the importance of sentences in processed information by using the importance determining model selected by the sentence model selection unit 124 .

여기서, 문장 중요도 판단부(121)는 각각의 문장별로 중요도 태그를 부착할 수 있다. Here, the sentence importance determiner 121 may attach an importance tag for each sentence.

문장 정규화부(126)는 상기 문장 중요도 판단부(121)에서 판단된 문장 중요도를 공통의 기준으로 정규화할 수 있다. The sentence normalization unit 126 may normalize the sentence importance determined by the sentence importance determining unit 121 based on a common criterion.

중요도 판단모델은 문장의 중요도 혹은 전체 문서와의 의미 유사도를 숫자로 표현하지만, 표현 기준과 범위가 상이할 수 있다. The importance judgment model expresses the importance of a sentence or the semantic similarity with the entire document as a number, but the expression standard and range may be different.

이는, 추후 문장 순위 책정 및 요약문 추출 시에 오차를 발생시킬 수 있다. This may cause errors in ranking sentences and extracting summaries later.

이를 예방하기 위해, 문장 정규화부(126)는 중요도 수치를 일정한 기준으로 정규화하여, 서로 다른 중요도 판단모델을 사용하더라도, 비교 가능한 중요도 값을 얻을 수 있도록 할 수 있다. In order to prevent this, the sentence normalization unit 126 normalizes the importance values based on a predetermined criterion so that comparable importance values can be obtained even when different importance judgment models are used.

문장 순위부(127)는 정규화된 문장들의 순서를 책정할 수 있다. The sentence ranking unit 127 may determine the order of normalized sentences.

요약문 추출부(122)는 순위가 높은 순서대로, 미리 정해진 개수 내로 문장들을 추출하여 요약문을 추출할 수 있다. The summary sentence extraction unit 122 may extract a summary sentence by extracting sentences within a predetermined number in order of high ranking.

즉, 요약문 추출부(122)는 중요도가 높은 문장 순으로서, 사용자가 원하는 수의 문장을 추출하여, 핵심 내용이 들어간 요약 문장으로 결정할 수 있다. That is, the summary sentence extraction unit 122 may extract a number of sentences desired by the user in order of sentences of high importance, and determine them as summary sentences containing key contents.

문장 피드백부(125)는 사전 검증 문장의 적절성 및 중요도 판단모델의 생성 기초가되는 데이터들의 적절성을 판단할 수 있다. The sentence feedback unit 125 may determine the appropriateness of the pre-verified sentence and the appropriateness of data that is a basis for generating the importance judgment model.

문장 피드백부(125)는 상기 사전 검증 문장을 이용하여 산출된 상기 중요도 판단모델의 요약 정확도가 미리 정해진 제1 판단 기준 이상일 경우, 상기 사전 검증 문장 갱신을 요청할 수 있다. The sentence feedback unit 125 may request update of the pre-verification sentence when summary accuracy of the importance judgment model calculated using the pre-verification sentence is equal to or greater than a predetermined first criterion.

구체적으로 설명하자면, 문장 피드백부(125)는 모든 사전 검증 문장의 문서 전문 또는 일부 발췌된 문장들을 이용하여 각각의 중요도 판단모델에 대한 요약 정확도를 산출할 수 있다.Specifically, the sentence feedback unit 125 may calculate summary accuracy for each importance judgment model using the full text of all pre-verified sentences or some extracted sentences.

여기서, 적어도 하나 이상의 요약 정확도가 미리 정해진 제1 판단 기준 이상일 경우, 상기 문장 피드백부(125)는 사전 검증 문장의 갱신을 요청할 수 있다. Here, when the accuracy of at least one summary is greater than or equal to a first predetermined criterion, the sentence feedback unit 125 may request update of the pre-verification sentence.

일례로, 미리 정해진 제1 판단 기준은 95%일 수 있다. For example, the first predetermined criterion may be 95%.

다만, 이에 한정하지 않고, 상기 미리 정해진 제1 판단 기준의 구체적인 수치는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and specific numerical values of the predetermined first judgment criterion may be variously modified at a level obvious to those skilled in the art.

이는, 유사도 판단의 과적합 가능성이 있기 때문에, 너무 높은 요약 정확도는 오히려 정확한 중요도 판단이 되지 않을 수 있다. Since there is a possibility of overfitting the similarity judgment, too high summary accuracy may not be an accurate importance judgment.

문장 피드백부(125)는 상기 사전 검증 문장을 이용하여 산출된 상기 중요도 판단모델의 요약 정확도가 미리 정해진 제2 판단 기준 미만일 경우, 상기 중요도 판단모델의 학습 데이터 점검을 요청할 수 있다. The sentence feedback unit 125 may request to check learning data of the importance determination model when summary accuracy of the importance determination model calculated using the pre-verification sentence is less than a predetermined second determination criterion.

구체적으로 설명하자면, 문장 피드백부(125)는 모든 사전 검증 문장의 문서 전문 또는 일부 발췌된 문장들을 이용하여 각각의 중요도 판단모델에 대한 요약 정확도를 산출할 수 있다. Specifically, the sentence feedback unit 125 may calculate summary accuracy for each importance judgment model using the full text of all pre-verified sentences or some extracted sentences.

여기서, 적어도 하나 이상의 요약 정확도가 미리 정해진 제2 판단 기준 미만일 경우, 상기 문장 피드백부(125)는 중요도 판단모델의 학습 데이터 점검을 요청하는 신호를 문서 정보처리 시스템(100)의 사용자 단말기로 송신할 수 있다. Here, when the accuracy of at least one summary is less than the predetermined second criterion, the sentence feedback unit 125 transmits a signal requesting inspection of the learning data of the importance judgment model to the user terminal of the document information processing system 100. can

여기서, 단말기는 일종의 컴퓨팅 장치로서 정보 처리 연산을 처리할 수 있는 장치를 의미할 수 있다.Here, the terminal may refer to a device capable of processing information processing operations as a kind of computing device.

일례로, 단말기는 데스크탑 컴퓨터, 노트북, 스마트폰, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 휴대용 단말기 등이 포함되는 이동 단말기 및/또는 스마트 TV 등이 포함될 수 있다.For example, the terminal may include a desktop computer, a laptop computer, a smart phone, a mobile terminal including a personal digital assistant (PDA), a portable multimedia player (PMP), a portable terminal, and/or a smart TV.

다만, 이에 한정하지 않고, 상기 단말기의 종류는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and the type of terminal can be variously modified at a level obvious to those skilled in the art.

요약 정확도가 너무 낮을 경우, 중요도 판단모델 생성의 기초가 되는 문서 데이터들이 부정확할 가능성이 매우 높다. If the summary accuracy is too low, it is very likely that the document data that is the basis for generating the importance judgment model are inaccurate.

따라서, 문장 피드백부(125)는 이를 피드백함으로써, 중요도 판단모델의 요약 정확도를 향상시킬 수 있다. Accordingly, the sentence feedback unit 125 may improve the summary accuracy of the importance judgment model by feeding back the sentence feedback.

또한, 문장 중요도 판단부(121)가 판단한 문장 중요도는 다시 문장 모델 생성부(123)로 전달되어, 중요도 판단모델의 기계 학습의 데이터로 활용될 수 있다. In addition, the sentence importance judged by the sentence importance determiner 121 is again transferred to the sentence model generator 123 and used as data for machine learning of the importance decision model.

일례로, 미리 정해진 제2 판단 기준은 60%일 수 있다. For example, the predetermined second criterion may be 60%.

다만, 이에 한정하지 않고, 상기 미리 정해진 제2 판단 기준의 구체적인 수치는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and specific numerical values of the predetermined second criterion can be variously modified at a level obvious to those skilled in the art.

키워드 추출 모듈(130)은 상기 가공 정보의 단어별로 상기 대상 문서와의 유사도를 산출하는 문장 유사도 판단부(131) 및 단어의 중요도를 고려하여 상기 가공 정보로부터 키워드를 추출하는 키워드 추출부(132)를 구비할 수 있다. The keyword extraction module 130 includes a sentence similarity determiner 131 that calculates a similarity with the target document for each word of the processed information and a keyword extractor 132 that extracts keywords from the processed information in consideration of the importance of words. can be provided.

또한, 키워드 추출 모듈(130)은 상기 대상 문서 상에서 상기 가공 정보의 유사도를 판단하는 문장 유사도 판단모델을 생성하는 키워드 모델 생성부(133), 사전 검증 문장을 이용하여 상기 문장 유사도 판단모델들의 선별 정확도를 산출하여, 선별 정확도를 기초로 복수의 상기 문장 유사도 판단모델들 중에서 하나의 문장 유사도 판단모델을 선택하는 키워드 모델 선택부(134) 및 상기 사전 검증 문장의 갱신 여부를 판단하는 단어 피드백부(135)를 더 구비할 수 있다. In addition, the keyword extraction module 130 includes a keyword model generation unit 133 for generating a sentence similarity determination model for determining the similarity of the processed information on the target document, and selection accuracy of the sentence similarity determination models using a pre-verified sentence. A keyword model selection unit 134 that selects one sentence similarity judgment model from among a plurality of sentence similarity judgment models based on the selection accuracy and a word feedback unit 135 that determines whether to update the pre-verified sentence by calculating ) may be further provided.

또한, 상기 키워드 추출 모듈(130)은 상기 문장 유사도 판단부(131)에서 판단된 단어의 유사도를 공통의 기준으로 정규화하는 키워드 정규화부(136) 및 중요도 태그를 기반으로 문장별 중요도 순위를 책정하는 키워드 순위부(137)를 더 구비할 수 있다. In addition, the keyword extraction module 130 includes a keyword normalization unit 136 that normalizes the similarity of words determined by the sentence similarity determination unit 131 based on a common criterion and a ranking of importance for each sentence based on an importance tag. A keyword ranking unit 137 may be further provided.

키워드 모델 생성부(133)는 기계 학습에 의해 문장 유사도 판단모델을 생성할 수 있다. The keyword model generation unit 133 may generate a sentence similarity judgment model through machine learning.

문장 유사도 판단모델은 복수개일 수 있다. There may be a plurality of sentence similarity judgment models.

일례로, 문장 유사도 판단모델은 'SBERT', 'TextRank' 등 일 수 있다.For example, the sentence similarity judgment model may be 'SBERT' or 'TextRank'.

다만, 이에 한정하지 않고, 상기 문장 유사도 판단모델의 종류는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and the type of the sentence similarity judgment model can be variously modified at a level obvious to those skilled in the art.

키워드 모델 생성부(133)가 문장 유사도 판단모델을 생성하기 위해서 수많은 문서, 문서를 가공한 가공 정보 및 가공 정보 내의 문장별 중요도 랭크들에 대한 데이터들이 활용될 수 있다. In order for the keyword model generation unit 133 to generate a sentence similarity judgment model, a number of documents, processed information obtained by processing documents, and data on importance ranks for each sentence in the processed information may be utilized.

키워드 모델 선택부(134)는 사전 검증 문장을 이용하여 문장 유사도 판단부(131)가 사용할 유사도 판단모델을 선택할 수 있다. The keyword model selection unit 134 may select a similarity judgment model to be used by the sentence similarity determination unit 131 using the pre-verification sentence.

사전 검증 문장은 문서 전문과 해당 문서의 키워드가 하나의 군으로 이루어지며, 복수개의 군들로서 이루어질 수 있다. The pre-verification sentence is composed of a whole document and keywords of the document as one group, and may be composed of a plurality of groups.

여기서, 상기 사전 검증 문장은 문장 형식에 따라 분류될 수 있다. Here, the pre-verification sentences may be classified according to sentence types.

키워드 모델 선택부(134)는 문장 유사도 판단모델을 통해 사전 검증 문장으로부터 소정 기준의 유사도를 가지는 단어들을 추출하고, 그 추출된 단어와 사전 검증 문장 내의 키워드를 비교하여 '선별 정확도'를 산출할 수 있다.The keyword model selector 134 may extract words having a similarity of a predetermined criterion from a pre-verification sentence through a sentence similarity judgment model, and compare the extracted words with keywords in the pre-verification sentence to calculate 'selection accuracy'. there is.

키워드 모델 선택부(134)는 '선별 정확도'가 가장 높은 문장 유사도 판단모델을 선택할 수 있다. The keyword model selector 134 may select a sentence similarity judgment model having the highest 'selection accuracy'.

키워드 모델 선택부(134)는 문서 형식에 따라 상기 사전 검증 문장을 선택할 수 있다. The keyword model selection unit 134 may select the pre-verification sentence according to the document format.

문장 유사도 판단부(131)는 문장 모델 선택부(124)에서 선택된 문장 유사도 판단모델을 이용하여, 가공 정보 내의 키워드가 문서와 얼마나 유사한지를 판단할 수 있다. The sentence similarity determining unit 131 may use the sentence similarity determining model selected by the sentence model selection unit 124 to determine how similar a keyword in processed information is to a document.

여기서, 단어 중요도 판단부는 문장 내의 단어별로 유사도 태그를 부착할 수 있다. Here, the word importance determiner may attach a similarity tag to each word in the sentence.

여기서, 가공 정보 내의 키워드와 문서가 얼마나 유사한지 판단하는데 상술한 유의어 사전이 활용될 수 있다. Here, the thesaurus described above may be used to determine how similar a keyword in processed information is to a document.

일례로, 가공 정보 내의 키워드와 유사한 단어가 문서 내에 존재하는 빈도수를 기준으로 키워드와 문서와의 유사한 정도가 산출될 수 있다. For example, the degree of similarity between the keyword and the document may be calculated based on the frequency of occurrence of words similar to the keyword in the processed information in the document.

키워드 정규화부(136)는 상기 단어 중요도 판단부에서 판단된 단어의 유사도를 공통의 기준으로 정규화할 수 있다. The keyword normalization unit 136 may normalize the similarities of the words determined by the word importance determining unit based on a common criterion.

문장 유사도 판단모델은 전체 문서에 대한 단어의 유사도인 문장 유사도를 숫자로 표현하지만, 표현 기준과 범위가 상이할 수 있다. The sentence similarity judgment model expresses the sentence similarity, which is the similarity of words to the entire document, as a number, but the expression standard and range may be different.

이는, 추후 키워드 추출 시에 오차를 발생시킬 수 있다. This may cause errors when extracting keywords later.

이를 예방하기 위해, 키워드 정규화부(136)는 중요도 수치를 일정한 기준으로 정규화하여, 서로 다른 문장 유사도 판단모델을 사용하더라도, 비교 가능한 문장 유사도 값을 얻을 수 있도록 할 수 있다. In order to prevent this, the keyword normalization unit 136 normalizes the importance value based on a predetermined criterion so that a comparable sentence similarity value can be obtained even when different sentence similarity judgment models are used.

키워드 추출부(132)는 문장 유사도 순위가 높은 순서대로, 미리 정해진 개수 내로 단어들을 선택하여 키워드로 선정할 수 있다. The keyword extraction unit 132 selects words within a predetermined number in order of highest sentence similarity ranking and selects them as keywords.

즉, 키워드 추출부(132)는 문장 유사도가 높은 단어 순으로 사용자가 원하는 수의 단어를 추출하여, 키워드로 선정할 수 있다. That is, the keyword extraction unit 132 may extract a number of words desired by the user in the order of words having a high sentence similarity and select them as keywords.

단어 피드백부(135)는 사전 검증 문장의 적절성 및 문장 유사도 판단모델의 생성 기초가되는 데이터들의 적절성을 판단할 수 있다. The word feedback unit 135 may determine the appropriateness of the pre-verified sentence and the appropriateness of data that is a basis for generating a sentence similarity determination model.

단어 피드백부(135)는 상기 사전 검증 문장을 이용하여 산출된 상기 문장 유사도 판단모델의 선별 정확도가 미리 정해진 제3 판단 기준 이상일 경우, 상기 사전 검증 문장 갱신을 요청할 수 있다. The word feedback unit 135 may request update of the pre-verification sentence when the selection accuracy of the sentence similarity judgment model calculated using the pre-verification sentence is equal to or higher than a predetermined third criterion.

구체적으로 설명하자면, 단어 피드백부(135)는 모든 사전 검증 문장의 문서 전문 또는 일부 발췌된 문장들을 이용하여 각각의 문장 유사도 판단모델에 대한 선별 정확도를 산출할 수 있다.Specifically, the word feedback unit 135 may calculate selection accuracy for each sentence similarity judgment model using the full text of all pre-verified sentences or partially extracted sentences.

여기서, 적어도 하나 이상의 선별 정확도가 미리 정해진 제3 판단 기준 이상일 경우, 상기 단어 피드백부(135)는 사전 검증 문장의 갱신을 요청할 수 있다. Here, when the selection accuracy of at least one or more is equal to or higher than the predetermined third criterion, the word feedback unit 135 may request renewal of the pre-verification sentence.

일례로, 미리 정해진 제3 판단 기준은 95%일 수 있다. For example, the predetermined third criterion may be 95%.

다만, 이에 한정하지 않고, 상기 미리 정해진 제3 판단 기준의 구체적인 수치는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and specific numerical values of the predetermined third criterion can be variously modified at a level obvious to those skilled in the art.

이는, 문장 유사도 판단의 과적합 가능성이 있기 때문에, 너무 높은 요약 정확도는 오히려 정확한 중요도 판단이 되지 않을 수 있다. Since there is a possibility of overfitting the sentence similarity judgment, too high summary accuracy may not be an accurate importance judgment.

단어 피드백부(135)는 상기 사전 검증 문장을 이용하여 산출된 상기 문장 유사도 판단모델의 선별 정확도가 미리 정해진 제4 판단 기준 미만일 경우, 상기 문장 유사도 판단모델의 학습 데이터 점검을 요청할 수 있다. When the selection accuracy of the sentence similarity determination model calculated using the pre-verified sentence is less than a predetermined fourth criterion, the word feedback unit 135 may request inspection of learning data of the sentence similarity determination model.

구체적으로 설명하자면, 단어 피드백부(135)는 모든 사전 검증 문장의 문서 전문 또는 일부 발췌된 문장들을 이용하여 각각의 문장 유사도 판단모델에 대한 선별 정확도를 산출할 수 있다. Specifically, the word feedback unit 135 may calculate selection accuracy for each sentence similarity judgment model using the full text of all pre-verified sentences or partially extracted sentences.

여기서, 적어도 하나 이상의 선별 정확도가 미리 정해진 제4 판단 기준 미만일 경우, 상기 단어 피드백부(135)는 문장 유사도 판단모델의 학습 데이터 점검을 요청하는 신호를 문서 정보처리 시스템(100)의 사용자 단말기로 송신할 수 있다. Here, when at least one sorting accuracy is less than the predetermined fourth criterion, the word feedback unit 135 transmits a signal requesting inspection of the learning data of the sentence similarity judgment model to the user terminal of the document information processing system 100. can do.

선별 정확도가 너무 낮을 경우, 문장 유사도 판단모델 생성의 기초가 되는 문서 데이터들이 부정확할 가능성이 매우 높다. If the selection accuracy is too low, it is very likely that the document data, which is the basis for generating the sentence similarity judgment model, are inaccurate.

따라서, 단어 피드백부(135)는 이를 피드백함으로써, 중요도 판단모델의 요약 정확도를 향상시킬 수 있다. Accordingly, the word feedback unit 135 may improve the summary accuracy of the importance judgment model by feeding back the words.

또한, 단어 중요도 판단부가 판단한 단어의 문장 유사도는 다시 문장 모델 생성부(123)로 전달되어, 중요도 판단모델의 기계 학습의 데이터로 활용될 수 있다. In addition, the sentence similarity of the words determined by the word importance determining unit is transmitted to the sentence model generating unit 123 again, and may be used as data for machine learning of the importance determining model.

일례로, 미리 정해진 제4 판단 기준은 60%일 수 있다. For example, the predetermined fourth criterion may be 60%.

다만, 이에 한정하지 않고, 상기 미리 정해진 제4 판단 기준의 구체적인 수치는 통상의 기술자에게 자명한 수준에서 다양하게 변형 가능하다.However, it is not limited thereto, and specific numerical values of the predetermined fourth criterion can be variously modified at a level obvious to those skilled in the art.

사전 검증 문장은 문장용 사전 검증 문장과 키워드용 사전 검증 문장으로 구분될 수 있다.The pre-verification sentences may be divided into pre-verification sentences for sentences and pre-verification sentences for keywords.

여기서, 문장용 사전 검증 문장은 요약 추출 모듈(120)에서 이용되고, 키워드용 사전 검증 문장은 키워드 추출 모듈(130)에서 이용될 수 있다. Here, the pre-verified sentence for the sentence is used in the summary extraction module 120, and the pre-verified sentence for the keyword can be used in the keyword extraction module 130.

도 3은 본 발명의 일 실시에에 따른 문서 정보처리 시스템이 구비하는 유의어 사전 모듈에 의해 구현되는 과정을 설명하기 위한 도면이고, 도 4는 본 발명의 일 실시예에 따른 문서 정보처리 시스템이 구비하는 요약 추출 모듈에 의해 구현되는 과정을 설명하기 위한 도면이고, 도 5는 본 발명의 일 실시예에 따른 문서 정보처리 시스템이 구비하는 키워드 추출 모듈에 의해 구현되는 과정을 설명하기 위한 도면이다.3 is a diagram for explaining a process implemented by a thesaurus module provided in a document information processing system according to an embodiment of the present invention, and FIG. 4 is provided in a document information processing system according to an embodiment of the present invention. 5 is a diagram for explaining a process implemented by a keyword extraction module included in a document information processing system according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 문서 정보 처리 방법은 문서 정보처리 시스템에 의해 구현되는 문서 정보처리 방법에 있어서,3, in the document information processing method implemented by a document information processing system according to an embodiment of the present invention,

본 발명의 일 실시예에 따른 문서 정보처리 방법은, 가공 모듈에 의해 대상 문서가 가공되어 적어도 하나의 문장으로 이루어진 가공 정보가 취득되는 단계인 취득 단계, 상기 가공 정보로부터 요약문이 추출되는 단계인 요약 추출 단계, 상기 가공 정보로부터 중요 키워드가 추출되는 단계인 키워드 추출 단계 및 상기 가공 정보로부터 추출된 단어인 기준 단어에 대해서 유의어 사전이 자동 생성되는 단계를 포함할 수 있다. A document information processing method according to an embodiment of the present invention includes an acquisition step in which a target document is processed by a processing module and processed information consisting of at least one sentence is acquired, and a summary is extracted from the processed information. It may include an extraction step, a keyword extraction step in which important keywords are extracted from the processed information, and a thesaurus automatically generated for reference words that are words extracted from the processed information.

도 3은 유의어 사전 모듈에 의해 구현되는 유의어 사전 구축 단계에 대한 순서도로서, 이에 대한 자세한 설명은 상술한 내용과 중복되는 한도에서 생략될 수 있다.3 is a flowchart of a thesaurus construction step implemented by the thesaurus module, and a detailed description thereof may be omitted to the extent that it overlaps with the above description.

도 4는 요약 추출 모듈에 의해 구현되는 요약 추출 단계에 대한 순서도로서, 이에 대한 자세한 설명은 상술한 내용과 중복되는 한도에서 생략될 수 있다. 4 is a flowchart of a summary extraction step implemented by the summary extraction module, and a detailed description thereof may be omitted to the extent that it overlaps with the above description.

도 5는 키워드 추출 모듈에 의해 구현되는 키워드 추출 단계에 대한 순서도로서, 이에 대한 자세한 설명은 상술한 내용과 중복되는 한도에서 생략될 수 있다. 5 is a flowchart of a keyword extraction step implemented by a keyword extraction module, and a detailed description thereof may be omitted to the extent that it overlaps with the above description.

첨부된 도면은 본 발명의 기술적 사상을 보다 명확하게 표현하기 위해, 본 발명의 기술적 사상과 관련성이 없거나 떨어지는 구성에 대해서는 간략하게 표현하거나 생략하였다.In the accompanying drawings, in order to more clearly express the technical idea of the present invention, components that are not related to or detached from the technical idea of the present invention are briefly expressed or omitted.

상기에서는 본 발명에 따른 실시예를 기준으로 본 발명의 구성과 특징을 설명하였으나 본 발명은 이에 한정되지 않으며, 본 발명의 사상과 범위 내에서 다양하게 변경 또는 변형할 수 있음은 본 발명이 속하는 기술분야의 당업자에게 명백한 것이며, 따라서 이와 같은 변경 또는 변형은 첨부된 특허청구범위에 속함을 밝혀둔다.In the above, the configuration and characteristics of the present invention have been described based on the embodiments according to the present invention, but the present invention is not limited thereto, and various changes or modifications can be made within the spirit and scope of the present invention. It is apparent to those skilled in the art, and therefore such changes or modifications are intended to fall within the scope of the appended claims.

110 : 가공 모듈 120 : 요약 추출 모듈
130 : 키워드 추출 모듈 140 : 유의어 사전 모듈110: processing module 120: summary extraction module
130: keyword extraction module 140: thesaurus module

Claims

For a search database, in a document information processing system implementing a process of processing documents,
A processing module for obtaining processing information consisting of at least one sentence by processing the target document;
a thesaurus module automatically generating a thesaurus for reference words that are words extracted from the processing information;
a summary extraction module extracting a summary sentence from the processing information; and
A keyword extraction module for extracting important keywords from the processing information;
The thesaurus module,
A pre-processing unit that pre-processes the processed information by using a method of separating into sentence units, a method of replacing Chinese characters with Korean characters, a method of removing individual grapheme, and a method of removing spaces before and after sentences. A tag unit that separates morphemes to generate reference words and attach tags thereto, a word similarity calculation unit that calculates a degree of similarity between words with respect to the reference words, and a thesaurus of the reference words based on the degree of similarity between the reference words A word similarity learning unit for generating a word similarity model for converting the reference word into Word2Vec in consideration of a degree of similarity between words and a dictionary construction unit added to the word similarity;
The word similarity calculation unit converts the reference word into a vector form so that the degree of similarity between words is reflected,
The word similarity learning unit,
Creating the word similarity model through unsupervised learning;
Characterized in that the dictionary construction unit excludes morphemes tagged with designated unused morphemes from among the reference words.
Document information processing system.

delete

According to claim 1,
The pre-construction unit,
Loading all learned words and generating the thesaurus based on the degree of similarity with the reference word.
Document information processing system.

According to claim 5,
The pre-construction unit,
Loading and then decoding the entire list of words learned in an encoded state to generate the thesaurus based on information similar to the reference word,
Document information processing system.

delete

According to claim 1,
The summary extraction module,
A sentence importance determination unit for calculating the importance of each sentence from the processed information, extracting a summary for extracting a summary from the processed information in consideration of the importance of the sentence, and generating an importance judgment model for determining the importance of the processed information on the target document Further comprising a sentence model generation unit,
The sentence model generation unit,
Generating the importance judgment model by machine learning,
Document information processing system.

According to claim 8,
The summary extraction module,
Further comprising a sentence model selection unit for calculating summary accuracy of the importance judgment models using a pre-verification sentence and selecting one importance judgment model from among a plurality of the importance judgment models based on the summary accuracy;
The pre-verification sentence,
Classified according to document type,
The sentence model selection unit,
Selecting the pre-verification sentence according to the document format,
Document information processing system.

In the document information processing method implemented by the document information processing system,
an acquisition step in which a target document is processed by a processing module to obtain processing information consisting of at least one sentence;
automatically generating a thesaurus for reference words that are words extracted from the processed information;
extracting a summary from the processing information; and
Extracting important keywords from the processing information;
The step of automatically generating the thesaurus,
The processing information is pre-processed using a method of separating in sentence units, a method of replacing Chinese characters with Korean characters, a method of removing individual grapheme, and a method of removing blanks before and after sentences, and morpheme separation to create reference words and tag is attached, the reference word is converted into a vector form to calculate the degree of similarity between words, and the reference word is added to the thesaurus based on the degree of similarity between the reference words, but the unused morpheme tag specified among the reference words is used. A morpheme with is not added to the thesaurus,
The degree of similarity between the words is calculated through a word similarity model that converts the reference word into Word2Vec,
The word similarity model is generated through unsupervised learning,
Document information processing method.